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Preface 


The 2017 SIS Conference aims to highlight the crucial role of the Statistics in Data 
Science. In this new domain of “meaning” extracted from the data, the increasing 
amount of produced and available data in databases, nowadays, has brought new 
challenges. That involves different fields of statistics, machine learning, informa- 
tion and computer science, optimization, pattern recognition. These afford together 
a considerable contribute in the analysis of “Big data”, open data, relational and 
complex data, structured and no-structured. The interest is to collect the contributes 
which provide from the different domains of Statistics, in the high dimensional 
data quality validation, sampling extraction, dimensional reduction, pattern selec- 
tion, data modelling, testing hypotheses and confirming conclusions drawn from the 
data. In the mention that statistics is the “grammar of data science”, statistics has 
become a basic skill in data science: it gives right meaning to the data. Still, it isn’t 
replaced by newer techniques from machine learning and other disciplines but it 
complements them. The Conference is also addressed to the new challenges of the 
new generations: the native digital generations, who are called to develop profes- 
sional skills as “data analyst”, one of the more request professionality of the 21st 
Century, crossing the rigid disciplinary domains of competence. In this perspective, 
all the traditional statistical topics are admitted with an extension to the related ma- 
chine learning and computer science ones. The present volume includes the short 
papers of the contributions that will be presented in the 4 invited speaker sessions; 
in the 19 specialized sessions; in the 11 solicited sessions; in the 6 foreign societies 
sessions and in the 17 contributed sessions as well as, in the panel session. 
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President of the Scientific Programme Committee 
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Determination of basis risk multiplier of a 
borrower default using survival analysis 


Determinazione del moltiplicatore di rischio di base di 
un default mutuatario attraverso un’analisi di 
sopravvivenza 


Alexander Agapitov, Irina Lackman, Zoya Maksimenko 


Abstract The provided research is directed to identification of the predictors affect- 
ing at sizes of basis risk multiplier of a loan default for a certain period. Survival 
models (Cox proportional hazard models) taking into account a grouping sign of 
rating of reliability of borrowers are put in the basis of calculations. In the con- 
ducted research data on loans in the Californian company Lending Club which is 
engaged in equal crediting were used. The borrower for whom the risk of approach 
of a default by a certain period was predicted acted as an object of the research. 
Abstract Lo scopo dell’analisi é quello di individuare gli elementi che influenzano 
il valore dei moltiplicatori del rischio base relativamente al mancato pagamento del 
prestito per un certo periodo. L’analisi è condotta mediante dei modelli di soprav- 
vivenza (modelli a rischi proporzionali; modelli di Cox), tenendo conto del gruppo 
di reputazione e dell’affidabilità dei debitori. Nella ricerca effettuata sono stati uti- 
lizzati i dati sui prestiti di una società californiana Lending Club, che si occupa 
della concessione del credito. L’oggetto della ricerca era il debitore, per il quale è 
stato determinato il rischio di insolvenza ad un certo periodo. 


Key words: survival analysis, Kaplan-Meier estimator, Cox proportional hazards 
model 
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1 Introduction 


Lending to individuals implies the consideration of all possible risks that could lead 
to the borrower obligation default. In the context of economic crisis the problem of 
population debt incurring becomes extremely urgent. In this regard, it is important 
to be able to identify correctly the multipliers of the basis risk of borrower default 
for a certain period, taking into account characteristics of both the borrower and 
features of the credit product. The survival analysis may serve as one of the tools 
for solving such problems. 

The main advantage of the survival analysis compared to other models of credit 
scoring is the model ability to work with the right censoring data, i.e. when the 
event (the default) for the object is not observed during the exploration period (for 
example, the borrower has already paid back the loan in full or he/she is still paying 
it at the end of the exploration period). Another advantage of survival models is the 
opportunity not only to assess whether the borrower will pay (will not pay) the loan, 
but also the possibility of estimating the time of loan obligation fulfillment in good 
faith. Here, the time of loan obligation fulfillment in good faith will be assumed as 
a conventional survival” of the borrower for the Bank, i.e. we consider that for the 
Bank the borrower has died if he/she ceases to make payment of loan installments 
on time. 

There are many studies using the survival analysis to estimate the default of 
banks: 


1. Survival analysis of private Banks in Brazil in the period of 1994-2007. [1] 

2. Bank failure prediction: a two-step survival time approach (the joint research of 
the University of Vienna and the National Bank of Austria) [3] 

3. Start-up banks default and the role of capital (the research of the Bank of Italy 
according to the data of 1994-2006) [4] 


There are also studies where the survival analysis is applied to examine the bor- 
rower’s default: 


1. Survival analysis methods for personal loan data [6] 
2. Credit scoring with macroeconomic variables using the survival analysis [2] 
3. Survival analysis in credit scoring: A framework for PD estimation [5] 


2 Data 


In our research we used loan data of the Californian company Lending Club, one of 
the largest peer-to-peer lender in the USA. The summary table on loans was taken 
from the Kaggle portal a platform holding competitions in the analysis of data. The 
sample is consisted of 887 379 observations for the period of 2007-2015. For this 
research we used 36-month loans, the final dataset comprises 602,871 loans. 
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The risk of event (default) occurrence was predicted for the borrower who was an 
object of observation. This object was under observation and therefore the borrower 
was included in the credit risk group: at any period of time there may occur an event 
when the borrower leaves the risk group. Observation period starts from the moment 
when the borrower takes a loan and finishes when the borrower default occurs. 

The independent variables (predictors) are characteristics of an object, which 
may influence the risk of event occurrence. We used the following predictors: in- 
terest rate on the loan, employee tenure, annual income, the region of the borrower 
inhabitation, residential property, credit history (the first loan), loan amount, loan 
purpose and financial reliability of the borrower calculated by Lending Club on the 
scale from A to D where A is the best possible grade and D the worst. 


3 Survival analysis 


At the first stage of the study the method of Kaplan-Meier was used to identify 
determinants of the borrower which are predictors of the loan obligation fulfillment 
in good faith. The graphs of survival functions obtained by using the Kaplan-Meier 
estimates have shown that the most predictors have significant differences between 
the group alternatives. Thus, it is possible to make a conclusion about the expediency 
of survival models application to solve these problems. 

At the second stage of the analysis there were tests with a null hypothesis of 
the survival indistinction groupwise: the log-rank criterion of Mantel-Haenszel and 
the criterion of Gehan-Wilcoxon. The value of statistics, the number of freedom 
degrees and significance level for every determinant of the borrower for each test 
are presented in table 1. 


Table 1 Survival analysis: tests 


r Log-rank test Gehan’s Generalized Wilcoxon 
Variable 
7’ statistic Degrees of p-value 17 statistic Degrees of p-value 
ae freedom oui freedom 

Home ownership |968 2 0 987 2 0 
Earliest credit line |742 2 0 749 2 0 
Interest rate 10887 3 0 11036 3 0 
Annual income 2097 6 0 2121 6 0 
Funded Amount 183 4 0 196 4 0 
Employment length |499 3 0 512 5 0 
Region of the US |130 8 0 131 8 0 
Credit purpose 1439 8 0 1395 8 0 


Test results showed a statistically significant difference between the groups for 
each variable. Thus, it was concluded that in order to build the Cox survival model 
it is necessary to use all predictors of the borrower. The choice preference of Cox 


4 Alexander Agapitov, Irina Lackman, Zoya Maksimenko 


proportional-hazards survival model in comparison with other models was made 
after the selection procedures based on Akaike and Schwarz information criterion. 

At the third stage of the model construction after a preliminary evaluation of the 
generalized data it was decided to evaluate the proportional hazards model, taking 
into account the bank’s customer groups of ”reliable” and ’unreliable” clients re- 
ceived by Lending Club. Table 2 shows the results of calculations by multipliers 
compared to the basis risk calculated using the Cox model for ”reliable” and ”unre- 
liable” clients respectively. 


Table 2: Survival analysis: Cox models 


Variable Level Good Bad 
A Mortgage 0.899 0.905 
Home ownership Own 0.922 0.946 


Earliest credit line from 1990 to 2000 1.115 1.041 


after 2000 1.150 1.145 
>10% 1.864 — 
Interest rate From 15% to 20% — 1.416 
>20% — 2.009 
From 15 to 30 — 0.998 
From 30 to 50 0.868 0.934 
Pena eee From 50 to 75 0.706 0.845 
From 75 to 100 0.597 0.741 
From 100 to 150 0.548 0.665 
>150 0.514 0.636 


From 5000 to 10000 1.049 1.228 
From 10000 to 15000 1.084 1.332 


Funded Amount From 15000 to 25000 1.160 1.431 


>25000 1.204 1.487 
Less than 1 year 0.766 0.943 
1 year 0.690 0.868 


Employment length From 2 to 5 years 0.691 0.852 
From 6 to 9 years 0.717 0.864 
10 and more 0.699 0.798 
Mountain 1.063 1.000 
West North Central 0.992 0.946 
Fast North Central 0.940 0.924 
West South Central 0.966 0.926 


Region ofthe US Rast South Central 1.128 1.076 
South Atlantic 1.050 0.983 
Mid-Atlantic 1.062 0.957 
New England 0.946 0.870 
Credit card 0.867 0.843 
Major purchase 0.888 0.874 
Other 1.216 0.955 


Purpose 
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Variable Level Good Bad 
Car 0.807 0.899 
Medical 1.369 1.136 
Small business 1.980 1.322 
House 1.044 1.134 


Home improvement 1.043 0.978 


As a result of the analysis the following conclusions can be made for the ”reli- 


able” clients: 


1. 


2. 


The risk of debt at higher interest rates remains. The risk for borrowers with high 
interest rate is in 1.86 times higher than for borrowers with low interest rate. 
”Reliable” clients risk of debt at the same annual income is lower than in the 
general model. 


. The size of the loan affects the risk of debt less than other factors. Credit larger 


than 25 thousand dollars increases the risk in 1.2 times (this indicator increases 
the risk in 1.5 times in the general model) compared to the baseline. 


. An interesting situation concerning the ’credit assignment” variable is observed. 


The risk of debt of a borrower with a loan to a small business is in 2 times more 
than customers with basis risk. Also the risk of borrowers with credit for medical 
services increases; it is in 1.4 times higher compared to the basis risk. 


4 Conclusions 


Reliable” clients have a lower risk of debt with high socio-economic indicators 
compared to the conventional model. At the same time, borrowers who took credit 
for small business have much higher risks. 


For unreliable” borrowers the following risks can be identified: 


. Borrowers who live in owner-occupied dwelling bear the risk by 10% greater 


compared to borrowers living in rented accommodation. 


. The risk of debt for borrowers with interest rates ranging from 15 to 20 percent 


is in 1.42 times greater than for borrowers with an interest rate of less than 15%. 
For customers who have a loan at the interest rate more than 20%, the risk of 
debt is in 2 times more compared to the basis risk. 


. Clients with annual income of less than $50 thousand, carry the same high risk 


of debt. 


. The risk of debt for borrowers whose loan size of more than 15 thousand dollars 


is in 1.45 times higher than customers with the basis risk. 


. Borrowers living on the East Coast of the USA, on average, carry lower risk of 


debt compared to the inhabitants of the West Coast and mountain states. 


. The risk of debt for borrowers who took credit for small business is by 32% 


higher compared to the basis risk. Borrowers with credit for medical services or 
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real estate purchase have the higher risk by 3%. Like in other models borrowers 
who took a loan to pay off the credit card debt or to buy a car carry the least risk. 


As aconclusion it can be given the following recommendation: in order to reduce 
the debt risk for ‘’unreliable’ customers, Lending Club Company should be more 
careful selecting customers who want to take a large loan (more than 15 thousand 
dollars). The high risk of debt is for borrowers with high interest rate. 
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School principals’ leadership styles and students’ 
achievement: empirical results from a three-step 


Latent Class Analysis 

Stili di leadership dei Dirigenti Scolastici e apprendimenti 
degli studenti: risultati da una three-step Latent Class 
Analysis 


Tommaso Agasisti, Alex J. Bowers and Mara Soncin 


Abstract This study exploits the existence of various leadership types in a sample of lower 
secondary school principals across Italy (N=1,073). Information is derived by a questionnaire 
provided by INVALSI (National Evaluation Committee for Education) about instructional 
practices and leadership perceptions. Employing a Latent Class Analysis (LCA), we identify 
three subgroups of school leaders. We then analyze if some principal’s individual 
characteristics and school context factors are statistically correlated with the probability of 
having a certain leadership styles’ attitude. Finally, we provide evidence that schools where 
the principal is adopting an “instructional” approach report lower academic test scores. 
Abstract La presente ricerca ha lo scopo di indagare l’esistenza di diversi stili di 
leadership in un campione rappresentativo di scuole secondarie di primo grado italiane 
(N=1,073). Le informazioni sono tratte dal questionario INVALSI (Istituto Nazionale per la 
Valutazione del Sistema Educativo) rivolto ai Dirigenti Scolastici, in cui vengono testate le 
pratiche manageriali e di leadership implementate nelle scuole. Implementando una Latent 
Class Analysis (LCA), vengono identificati tre approcci alla leadership scolastica. 
Successivamente, viene analizzata la correlazione tra tali approcci ed una serie di 
caratteristiche di contesto e del Dirigente Scolastico in capo. Infine, l’analisi mostra come i 
Dirigenti Scolastici che adottano una leadership “educativa” riportano punteggi inferiori 
nei test standardizzati. 
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1. Introduction and existing literature 


Research evidence has demonstrated the importance of school leadership in 
influencing students’ success in both cognitive and non-cognitive outcomes 
(Robinson et al., 2008, Waters et al., 2003). Among school factors, leadership is 
second only to classroom conditions in influencing achievement (Day et al., 2009, 
Leithwood et al., 2008). Looking for the existence of different leadership styles, 
literature has moved from the predominant role of instructional leadership (Smith & 
Andrews, 1989) to a more comprehensive vision of school leadership, with a 
growing emphasis on transformational, transactional and distributed leadership (Day 
et al., 2016, Marks & Printy, 2003, Urick & Bowers, 2014). Given the fact that the 
relationship between leadership styles and student achievement/engagement can be 
both direct and mediated by the role of teachers in the classroom or by school 
contextual conditions, the search for the most suitable model to measure this 
association is still fully open (e.g. Grissom et al., 2015). Moreover, school leaders 
are characterized by different attitudes and approaches in conducting managerial 
activities. So, part of the literature on the topic is devoted to explore how much of 
various managerial actions is actually adopted in day-to-day life of school principals 
(Bloom et al., 2015, Di Liberto et al., 2015). Moreover, the leadership style is not 
only influenced by the managerial content of principal’s activities, but also by a set 
of contextual conditions and principals’ individual characteristics (i.e. mediator 
factors, Leithwood & Levin, 2005). The current study addresses these issues, aiming 
at identifying the existence of different leadership types in a sample of Italian school 
principals and establishing how the different types relate to student achievement. 
These objectives are pursued through applying a three-step Latent Class Analysis 
(LCA), a statistical model that both allows for the identification of subgroups of 
individuals within data and to relate this finding to a distal outcome measured (e.g. 

Boyce & Bowers, 2016). More specifically, the research questions addressed are: 
Li To what extent is there one or more than one subgroup (latent classes) of 
leadership types (subgroups) from national-level surveys of principals in 

Italy around transformational and instructional leadership? 

il. Which are the main factors associated with the probability that a principal 
belongs to a specific subgroup of responders? 

iii. To what extent is a typology of school leadership in Italy across 
transformational and instructional leadership behaviors related to student 
achievement on standardized tests? 

This research is particularly innovative in the Italian context, where studies about 
leadership styles and managerial practices at school are still in an early stage (Bloom 
et al., 2015, Di Liberto et al., 2015). Moreover, the topic is particularly interesting in 
the policy context, given the approval of a law that empowers the role of school 
principals starting from 2015/16 school year (law 107/2015). 

The paper is organised as follows: paragraph 2 refers to data and methodology and 
paragraph 3 describes the results obtained. Finally, paragraph 4 discusses and 
concludes. 
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2. Data and Methods 


Data used in the study is provided by the National Evaluation Committee for 
Education (INVALSI) that yearly assesses the competencies of Italian students in 
reading and mathematics at given grades. Current data refer to grade 8, last year of 
lower secondary school, and to the school year 2014/15. The test is taken at national 
level, but every year a set of schools are randomly chosen throughout the country to 
be part of the National Sample (NS), where assessment is monitored by external 
evaluators. In 2014/15 wave, the NS is composed by 28,494 students across 1,405 
schools. In addition, from 2013/14 school year, NS principals are also asked to fill in 
a questionnaire about their schools and the way they manage the organisation. Two 
sections of the questionnaire have primary importance for the current analysis: the 
one reporting managerial practices used in the school and that containing principals’ 
characteristics. The kind of questions posed to principals are in line with those 
contained in OECD TALIS 2008 and 2013 (OECD, 2010, 2014). This part is 
composed by two groups of questions: the first concerns the frequency of application 
of a set of instructional leadership practices, with a total number of 12 sub-questions 
posed and a four categories Likert scale as response type. The second question 
proposes a list of statements (11 items in total) about the leadership role of the 
principal in the school. The output of the 23 items has been dichotomized to better 
fit the LCA purpose. Descriptive statistics about the cited items are listed in Table 1. 

Merging the principal’s questionnaire with the students’ results, the final sample size 
is 1,073 schools. 

The approach used in the study is a Latent Class Analysis (LCA), a statistical method 
from mixture modelling that enables to verify the existence of different subgroups 
within data (Muthén & Muthén, 2000, Muthén, 2004). The current model has been 
run using Mplus version 7.4. 

Figure 1 reports the model employed. Indicators are the only factors taken into 
account when investigating the existence of different subgroups across data (step one 
of the analysis); in the current analysis, they consist of the two questions posed to 
principals about the leadership practices. Looking at descriptive statistics in Table 1, 
it can be noticed that the question concerning the role of the school leader (second 
half of the table) tends to report a higher level of polarization towards what is 
considered the positive answer (high level of agreement). In order to deal with the 
trade-off between the number of indicators and the variability across answers, we 
only delete the three items with a polarization of answers by 99%-1%, obtaining a 
final sample of 20 indicators. Covariates are then used to characterised the 
individuals belonging to each group. Being added at step two, none of these factors 
are able to influence the groups definition (which takes place at step one). On the 
contrary, it helps to explain group differences, stating how much more/less likely the 
individuals belonging to a group are to report a specific characteristic (step two of 
the analysis). Finally, distal outcomes are used as the outputs of the model and 
defined as factors possibly affected by the belonging of an individual to a class. In 
other terms, the aim is to identify if across groups there is any statistical difference in 
the outcome measured (step three of the analysis). 
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Table 1. Descriptive statistics of questions about instructional leadership. 

Frequency of use of managerial practices | Seldom Often 
1. I make sure that teachers’ professional development activities are in line 
with the school’s educational objectives. 

2. I make sure that teachers work in conformity with school educational 
objectives. 

3. I observe educational activities in the classrooms. 

4. I provide teachers with suggestions for improving their teaching 
effectiveness. 

5. I supervise students’ works 

6. When a teacher has a problem in the classroom, I take the initiative to 


16% 84% 


6% 94% 
46% 54% 
51% 49% 
73% 27% 


0, 0, 
discuss with him/her about it. TA si 
7. I inform teachers on opportunities of disciplinary and educational 3% 97% 
update. 

8. I encourage work which is goal oriented and/or based on the Formative 7% 93% 


Offer Plan 

9. I take into account test scores when I make decisions on the school 
curriculum. 

10. I make sure that responsibilities on the coordination of the school 
curriculum are clearly defined. 

11. I deal with bothering behaviors in the classes. 

12. T substitute teachers unexpectedly absent. 


26% 74% 


14% 86% 
18% 82% 
717% 23% 


Opinions about their leadership role Disagree Agree 


13. In my job, it is important to make sure that educational strategies, 
approved by the Ministry, are explained to new teachers and applied by 
more experienced teachers. 

14. The use of students' test scores in order to evaluate the teacher's 
performance reduces the value of his/her professional judgment. 

15 Giving teachers a high degree of freedom in choosing the educational 
techniques can reduce teaching effectiveness. 

16. In my job, It is important to make sure that teachers’ skills are 
improving continuously. 

17. In my job, It is important to make sure that teachers feel responsible for 
the achievement of school objectives. 

18. In my job, It is important to be convincing when presenting new 
projects to parents. 

19. It is important for the school to verify that rules are respected by 
everybody. 

20. It is important for the school to avoid mistakes in administrative 
procedures. 

21. In my job, It is important to solve timetable problems and/or lesson 
scheduling problems. 

22. It is important that I contribute to maintain a good school climate. 
23. I have no possibility to know whether teachers are well performing 
their teaching tasks or not. 


6% 94% 


58% 42% 


75% 25% 


2% 98% 


1% 99% 


13% 87% 


1% 99% 


5% 96% 


36% 64% 
1% 99% 
94% 6% 


Figure 1. Statistical and Conceptual Model of the Latent Class Analysis (LCA) of 
Principal Leadership Styles. 
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Frequency of managerial Perceptions about 
practices application principal's leadership role 


Context factors Students’ achievement 
(Covariates) (Distal outcomes) 


School context and Mathematics test 
composition score 


School principal's Reading test score 
characteristics 


3. Results 


3.1. Baseline results: groups of leadership types, individual 
characteristics and difference in student achievement 


Applying the analysis to the INVALSI data about 1,073 school principals across 
Italy, we identify three different groups of leadership styles. Table 2 reports the main 
fit statistic tests leading to this finding. The BIC and the LMR test jointly agree 
about the number of classes to be considered: the BIC starts to increase in 
correspondence to a number of classes equal to four (from 16,925.6 to 16,988.7), 
whilst the LMR test is no longer significant with the four classes model (p- 
value=0.1387) (Lo et al., 2001, Muthen & Asparouhov, 2006). In addition to that, 
entropy of the model keeps on level of 0.707 up to 1 and the Akaike Information 
Criterion (AIC) is equal to 16,616.9. To test for the existence of a local minimum in 
the best number of classes, we reiterate the model with an additional number of 
groups. Results from both BIC and LMR test keep on confirming that three is the 
best number of classes into grouping school principals. 


Table 2. LCA results and fit statistics. 


-Lo LMR 
Classes AIC BIC dani i sa p Entropy 
2 16,851.2 17,0553 -8,875.0 974.1 0.0000 0.677 
3 16,616.9 16,925.6 -8,246,6 276.3 0.0000 0.707 
4 16,575.5  16,988.7 -8,204.7 82.9 0.1387 0.731 
5 16,552.2  17,069.9 -8,204.7 64.8 0.4366 0.688 


Note: AIC=Akaike information criterion; BIC=Bayesian information criterion; 
LMR=Lo-Mendell-Rubin likelihood ratio test. 
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Figure 2 reports the average proportion by group of school principals, according to 
the 20 indicators (reported on the X axis). The vertical dotted line separates the 
indicators concerning the frequency of application of leadership and managerial 
practices (on the left) from the indicators about the leadership role (on the right). We 
named the three groups identified as Adult developers (49% of the total), 
Instructional leaders (35% of the total) and Transformational leaders (16% of the 
total). Adult developers represents nearly half of the total sample, and show a 
particularly high focus on supporting teachers’ development and training and a lower 
level of active intervention in classroom activities (Drago-Severson, 2009). In fact, 
they demonstrate low levels of presence in the classroom, such as substituting 
teachers, supervising educational activities or facing annoying behaviours among 
students, so that they can be considered particularly inclined towards adult 
leadership. Instructional leaders represent one third of the total sample (35%) and 
report an averagely high level of application of the practices reported, which all 
concern instructional leadership. They are able to cover all the aspects of educational 
practices happening within the school, with the possible risk of posing their role too 
close to an operational one. Finally, transformational leaders (16%) are so labelled 
given the high level of orientation towards training opportunity information and the 
importance of making teachers’ skills improve continuously. These are two pillars of 
transformational leadership in terms of the ability to increase teachers’ engagement, 
skills and ability (Leitwood & Jantzi, 2000, Marks & Printy, 2003, Robinson et al., 
2008). 


Figure 2. Statistical indicator plots of the groups of three leadership styles. 


Adult developers | 
(49.0%) | 


Average proportion by group 


Note: indicators are reported on the X axis. The vertical dotted line divides questions 
about the frequency of use of leadership practices (on the left), from opinions about 
the principal’s leadership role (on the right). Adult developers, N=526; instructional 
leaders, N=375; transformational leaders, N=172. 
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Table 3 reports the results about the covariates used to characterise the groups. Adult 
developers is kept as the reference group as it is the largest one, so odds ratio are 
reported for significant measures with reference to it. In details, instructional leaders 
are 1.92 times more likely to be head of schools located in Central Italy (p- 
value<0.05), and 4.55 times more likely to be in Southern Italy (p-value<0.01). 
Moreover, instructional leaders are much more likely (29 times, p-value<0.01) to 
manage private schools, though it should be noticed the extremely small number of 
these kind of schools (which represent the 3% of the overall sample, so it can be that 
estimates are imprecise). With reference to the individual characteristics of school 
principals, instructional leaders are more likely to be women and older than adult 
developers (p-value<0.01), who in turn are more likely to have these characteristics 
than transformational leaders (p-value<0.01). Moreover, instructional leaders are 
less likely to have a contract of regency (with which principals are in charge of more 
than one school) than adult developers (p-value<0.05). Somehow, the fact that they 
do not have to manage different schools can give them higher possibilities to actively 
intervene in classroom activities. In turn, transformational leaders are more likely to 
manage more than one school with respect to adult developers (p-value<0.05). 
Finally, instructional leaders are also less likely to be appointed from less than two 
years, a span of time that suggests if the principal was already managing the school 
when the cohort of students analysed (who attends grade 8) entered the lower 
secondary school, two years before. 

Finally, Table 4 reports the school average test score in mathematics and reading per 
group, the distal outcome employed in the analysis. Results show that students in 
schools run by instructional leaders report a significantly different and lower average 
score than students’ results in the other two groups and in both the subjects tested. 
The average school score tends to be higher in reading than in mathematics, with an 
overall mean respectively of 61.3 and 54.5. Though, instructional leaders report an 
average school score of 59.6 in reading and 52.8 in mathematics. 


Table 3. Means and Odd Ratios for Covariates. 


Adult Instructional Transformational 
developers leaders leaders 
Odds Odds Odds 
School principals' characteristics Mean ratio Mean ratio year ratio 
Average SES index 0.016 - -0.003 0.013 
Context: school in Central Regions 0.19 - 0.21 1.92** 0.20 
Context: school in Southern Regions 0.33 - 0.54 4.55*** 0.25 
Private school 0.01 - 0.07 29.96*** 0.01 
Age (years) 55 - S7 107*** 54 0.96* 
Gender (female SP = 1) 0.66 - 0.74 2.27*** 0.51 0.44*** 
Education (PhD = 1) 0.04 - 0.02 0.03 
Experience as SP (years) 9.0 - 10.2 9.2 
Temporary contract 0.03 - 0.04 0.02 
Contract of regency 0.08 - 0.04 0.38** 0.12 1.97* 
oot in the school from less than 2 0.41 £ 031 0.59** 0.38 


Note: Significance tests are logistic regressions. *p<.10. ** p<.05. *** p<.01. 
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Table 4. Means and p-values for distal outcomes (grade 8). 


aeut Instructional Transformational p- p- 
i. leaders (2) leaders (3) yale valné prae 
1vs2 2vs3 
School principals' characterics Mean Mean Mean 
School average mathematics test 55.52 52.81 54.86 dad desi 0508 
score - grade 8 
School avofage reading fegtscóre =". pbc 59.61 62.60 0.020 0.005 0.612 


grade 8 
Note: Significance tests are Pearson chi-square. 


4. Discussion and concluding remarks 


This study aims at investigating the existence of various leadership types across 
Italian lower secondary schools. Moreover, each subgroup of school leaders is 
characterized according to individual and contextual characteristics. Finally, the 
statistical difference across groups in student achievement is investigated. Applying 
a Latent Class Analysis (LCA), we define three subgroups of school leaders, namely 
adult developers (49%), instructional leaders (35%) and transformational leaders 
(16%). Groups differ in terms of principals’ individual characteristics (age, gender, 
type of contract) and institutional/contextual factors (public/private ownership and 
geographical location). Finally, we observe a statistically significant difference 
across groups in student achievement, with instructional leaders running schools with 
lower test scores. In interpreting these results, we are cautious about the direction of 
causality (if any) between school principals leadership styles and school average test 
scores. In terms of policy implications, results suggest a direction for the evaluation 
of school principals. Indicators used in this process should focus on stimulating those 
activities which, in turn, show a higher probability to be associated with better 
school academic results — for instance, teachers’ training and development. On the 
other hand, principals should be less involved in operational activities that could 
affect their effectiveness in leading the whole organisation. Future directions of 
research should aim at finding patterns of leadership types within specific 
approaches to managerial practices (in terms of areas of management, see Bloom et 
al., 2015, Di Liberto et al, 2015). This would allow to better investigate the 
relationship between leadership styles and managerial practices implemented. 
Moreover, it would be interesting to have additional years of data, in order to analyse 
weather the effects of leadership are stable over time, in the light of the recent policy 
changes. 
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Poverty measures to analyse the educational 
inequality in the OECD Countries 


Misure di povertà per l’analisi delle diseguaglianze 
educative nei paesi OECD 


Tommaso Agasisti, Sergio Longobardi and Felice Russo 


Abstract This paper studies the degree of educational poverty in OECD countries 
on the basis of last edition (2015) of OECD Programme for International Student 
Assessment (PISA). The definition of ‘poor in education’, in terms of PISA data, 
refers to the students below the baseline level of proficiency that is required to 
participate fully in society. We adopt both one-dimensional and multidimensional 
approach to measure poverty in education. In this light, the educational poverty is 
analysed by the poverty metrics developed by Foster, Greer and Thorbecke and 
those proposed by Alkire and Foster. The main results of our analysis provide a 
detailed picture of the degree of poverty relative to student learning in OECD 
countries, and they can be considered an analytical tool to improve the quality of 
educational systems. 


Sommario // lavoro analizza il grado di povertà educativa nei paesi OECD sulla 
base dell’ultima edizione (2015) del Programme for International Student 
Assessment (PISA) dell’OECD. La definizione di povero in ambito educativo fa 
riferimento, in termini di dati PISA, agli studenti al di sotto della soglia di 
rendimento richiesta per avere una partecipazione attiva nella società. Per misurare 
la povertà educativa viene adottato sia un approccio unidimensionale che 
multidimensionale. In questa ottica, si riccore sia agli indicatori sviluppati da 
Foster, Greer e Thorbecke sia quelli proposti da Alkire e Foster. I principali 
risultati offrono un quadro dettagliato del livello di povertà educativa nei Paesi 
OECD e costituiscono uno strumento analitico per migliorare il livello qualitativo 
dei sistemi educativi. 
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1. Introduction 


The targeting of educational poverty eradication/alleviation is largely considered 
a very relevant topic and recently it has captured the attention of governments and 
international institutions. The main reason of this attention is that poor performance 
at school has negative impact on the future educational and socio-economic 
attainment of students (Erickson et al., 2005) and long-term consequences for 
society as a whole (OECD, 2016). 

In this light, the main insight of this paper is to base on the well-developed 
techniques applied to economics studies about poverty in order to adapt it to some 
features of the educational data. The main question that arises here is how to 
quantify the extent of poverty. We guess that a poverty analysis based only on a 
single education program cannot provide an exhaustive description of the whole 
learning deprivation matter. As a consequence, our study will not be limited to a 
single attribute-based approach and we propose both one-dimensional and 
multidimensional analysis. We use the data on students performance in the three 
main domains investigated by the OECD PISA (Programme for International 
Student Assessment): reading, mathematics and science. These scores have 
statistical properties similar to income data e.g they are both classified as individual 
and continuous observations. Moreover, these two variables are important predictors 
of individual and collective well-being. The one-dimensional analysis of poverty in 
education is performed by estimating three educational deprivation indices 
developed by Foster, Greer and Thorbecke (1984). The first index (educational 
deprivation headcount) is the proportion of the student population for whom 
learning is below the educational poverty line. The second (educational deprivation 
gap index) considers the student’s gap from the educational deprivation line. The 
third index (educational deprivation severity index) attributes greater weight to the 
very poor rather than the less poor, taking into account also the inequality among the 
poor. 

Focusing on the multidimensional aspects of educational poverty, we provide an 
application of additive multidimensional poverty index proposed by Alkire and 
Foster (2011). The rest of the study is structured as follows. Section 2 presents the 
OECD data and the methodology. It proposes different tools in order to describe the 
extent and the changes in poverty. Section 3 discusses the empirical evidence. 


2. Data and methodology 


The analysis of educational poverty and deprivation draws upon the OECD PISA 
data. The aim of the PISA is to collect highly standardised data that can be used to 
compare the competencies of representative samples of 15-year-old students in the 
three main domains of reading, mathematics and science, both within and between 
countries. Since the first cycle in 2000, PISA has been taking place every 3 years 
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with a growing number of participating countries and each of these cycles looks in 
depth at a major domain. 

We exploit the last results of PISA 2006 and 2015 editions for all 35 OECD 
countries and compute some poverty metrics focusing on the test scores in science, 
as it was the major domain in both PISA rounds. 

From a methodological point of view, the analysis of poverty consists of two 
steps (Sen, 1976): first, the identification of the poor by defining a threshold 
(poverty line); second, the aggregation of the poor. According to OECD (2010), the 
threshold is given by a proficiency score corresponding to the lowest limit of level 2 
in an ordered scale that goes from 1 (lowest skilled students) to 6 (highly-skilled 
students) proficiency levels!. 

In a one-dimensional context, at a given point of time, a poverty statistic P is a 
function of the value of learning distribution X and poverty line Z. The poverty 
indexes of Foster-Greer-Thorbecke (FGT) are the most widely used poverty 
measures (see Foster et al., 1984), it can be defined as: 


n.00 =— Y (ZA D 


1sisq 

where the parameter a is a non-negative parameter, N is the total population, q is 
the number of units with learning less than Z and x; is the result of standardized test 
of the unit of observation i, for i=1, 2,..., N. 

According to a=0, we obtain the educational deprivation headcount. For a=1, we 
get the educational deprivation gap index. Finally for a=2, we obtain the educational 
severity gap index. Using all three measures gives a fuller view of poverty, 
reflecting different aspects -incidence, depth, and severity, respectively- of 
educational poverty. The magnitude and the direction of their changes might differ. 

In a multidimensional analysis with, at least, two dimensions a deprived student 
in both attributes should be considered as poor without ambiguity. However, 
differently with respect to the one-dimensional case, how should be defined a 
student deprived in only one learning attribute? The literature suggests two extreme 
approaches in accordance with the distinction between the «intersection» method 
and the «union method»: the former approach identifies the observation i as poor if 
he/she is poor in both attributes, while in the latter the student is considered poor if 
his/her learning outcome is below the poverty cut-off in at least one attribute. In this 
study, we’ll present results according to the latter method. In a multidimensional 
context, Alkire and Foster (2011) advocate a second cut-off, k, according to the 
number of dimensions in which the individual has to be deprived in order to be 
considered globally poor. Indicating with c; the number of educational deprivations 
suffered by observation i, he/she should be judged educationally globally poor if c; > 


k. With this dual cut-off, Alkire and Foster propose the following M, class of 


measures: 


' Thus, in what follows we use absolute poverty thresholds. The proficiency level 2 is the baseline level 
of proficiency that is required to participate fully in society. The lowest limit of level 2 corresponds to 
407 point for Reading, 420 for Mathematics and 410 for Science. 
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The weight assigned to dimension j is Wis such that W;= d. 
í jel 
The parameter a is still a non-negative poverty aversion parameter, specific for each 
dimension. The attributes, which are considered here, are independent - they are 
neither substitutes nor compliments - and the aggregation procedure is not sensitive 
to the kind of interrelationship between educational deprivation dimensions. Once 
the identification step has been accomplished, the AF index can be decomposed into 
the contribution of each attribute. 


3. Main results 


Firstly, we provide results about FGT educational poverty measures in table 1 by 
using PISA scores in science. By looking at the proportion of students poorest in 
education, we notice that there is reduction of the incidence of poverty for 13 
countries. That is true in particular for countries in the bottom of the educational 
poverty outcomes in 2006 (e.g. Mexico, Turkey, Italy, USA, Portugal). For the same 
countries, this trend is confirmed also for a=1 and a=2. More interestingly, the 
extent of poverty declines more rapidly when it is measured by the educational 
deprivation gap index rather than the educational deprivation headcount, this is a 
signal that the benefits of poverty reduction accruing to the less poor are lower than 
those to the very poor. Moreover, the same conclusion is true, with the exception of 
Poland, when we compare the outocomes of educational deprivation gap index with 
those of educational deprivation severity index: the value of the latter poverty 
measure indicates a clearer decline in poverty. Now, we would like to bring the 
attention of the reader to the fact that, for seven countries, the estimates of the 
variation have contradictory signs depending on the value of the parameter a. In 
contrast with the increase in the incidence and depth of educational poverty, actually 
the severity of educational poverty decreases in Canada, Chile, Germany, France 
and Luxembourg. Great Britain and Ireland follow the same path: only with a=0 the 
educational poverty in science rises between 2006 and 2015. In those cases, the 
distance between poorly performing students and educational poverty line narrowed 
over time, while educational deprivation headcount outcomes show an increase from 
2006 to 2015. In those countries evidently, the choice of poverty measure does 
matter. The direction of change is fully reversed when considering the other 
countries that complete our PISA OECD database: proportionate changes in all our 
poverty measures are always positive over time. 

In particular, we observe a sharp rise in educational poverty in Finland, Hungary, 
Nederland, Sweden and Slovakia. For those countries, all the values of deprivation 
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measures increase at least by 1/3 and educational poverty rises with a. 


Table 1: Index of Foster-Greer-Thorbecke (FGT) for different values of a, PISA scores in Science. 


Educational deprivation Educational deprivation Educational deprivation 

ont headcount gap severity Index 
(a=0) (a=1) (a=2) 

2006 2015  var.% 2006 2015  var.% 2006 2015  var.% 
AUS 0.127 0.175 38.14% 0.016 0.023 42.00% 0.004 0.005 32.58% 
AUT 0.164 0.209 27.43% 0.020 0.026 27.73% 0.004 0.005 19.21% 
BEL 0.171 0.200 16.41% 0.023 0.026 13.17% 0.005 0.005 0.73% 
CAN 0.099 0.113 14.90% 0.011 0.012 8.90% 0.002 0.002 -0.45% 
CHE 0.159 0.188 18.21% 0.021 0.023 6.55% 0.005 0.005 -7.65% 
CHL 0.391 0.346 -11.58% 0.058 0.046 -20.72% 0.014 0.010 -29.07% 
CZE 0.155 0.205 32.46% 0.019 0.024 30.63% 0.004 0.004 15.84% 
DEU 0.160 0.169 5.94% 0.020 0.021 4.09% 0.004 0.004 -2.11% 
DNK 0.180 0.158 -12.07% 0.021 0.019 -12.94% 0.004 0.004 -20.09% 
ESP 0.197 0.185 -5.94% 0.024 0.021 -11.86% 0.005 0.004 -22.63% 
EST 0.076 0.081 6.22% 0.007 0.008 20.72% 0.001 0.001 31.00% 


FIN 0.039 0.116 198.79% 0.004 0.014 267.92% 0.001 0.003 341.67% 
FRA 0.216 0.217 0.29% 0.031 0.031 0.10% 0.007 0.007 -4.32% 
GBR 0.170 0.176 3.09% 0.024 0.020 -15.57% 0.006 0.004 -33.74% 
GRC 0.245 0.333 35.98% 0.035 0.048 35.19% 0.009 0.011 22.89% 
HUN 0.151 0.257 69.91% 0.016 0.036 120.73% 0.003 0.008 148.86% 
IRL 0.156 0.157 0.80% 0.019 0.017 -10.74% 0.004 0.003 -22.70% 
ISL 0.206 0.258 24.86% 0.028 0.033 17.98% 0.006 0.007 6.79% 
ISR 0.359 0.310 -13.59% 0.063 0.050 -21.72% 0.017 0.012 -28.81% 
ITA 0.254 0.233 -8.28% 0.035 0.030 -13.27% 0.008 0.006 -22.21% 
JPN 0.120 0.095 -20.47% 0.016 0.011 -33.52% 0.004 0.002 -46.67% 
KOR 0.113 0.141 25.03% 0.013 0.017 29.55% 0.003 0.004 34.46% 
LUX 0.220 0.258 17.63% 0.031 0.033 6.52% 0.007 0.006 -8.09% 
LVA 0.175 0.171 -2.25% 0.020 0.017 -13.06% 0.004 0.003 -28.53% 
MEX 0.511 0.469 -8.28% 0.079 0.062 -21.11% 0.019 0.013 -31.85% 
NLD 0.127 0.184 44.78% 0.014 0.022 57.62% 0.002 0.004 73.66% 
NOR 0.218 0.186 -14.96% 0.029 0.023 -20.94% 0.007 0.005 -31.26% 
NZL 0.134 0.175 30.49% 0.019 0.022 17.64% 0.004 0.004 6.18% 
POL 0.173 0.163 -5.75% 0.019 0.017 -8.59% 0.003 0.003 -7.78% 
PRT 0.243 0.179 -26.35% 0.030 0.020 -33.66% 0.006 0.004 -39.84% 
SVK 0.203 0.305 50.15% 0.026 0.047 80.39% 0.006 0.011 98.43% 
SVN 0.141 0.151 7.50% 0.015 0.017 11.02% 0.003 0.003 12.06% 
SWE 0.163 0.216 32.42% 0.020 0.031 53.17% 0.004 0.007 64.88% 
TUR 0.463 0.447 -3.50% 0.063 0.060 -4.91% 0.013 0.012 -6.86% 
USA 0.247 0.198 -20.01% 0.035 0.025 -28.98% 0.008 0.005 -37.74% 


Turning to multidimensional analysis, table 2 presents our findings for the AF 
educational deprivation index when a=0. This measure both synthetizes the overall 
learning deprivation and can be decomposed into the contribution of each attribute. 
We name this statistic adjusted educational deprivation headcount, i.e. the total 
number of dimensions that the multidimensionally poor population experience over 
Nxd, the maximum total number of dimensions in which the student population can 
be deprived. For all learning dimensions (reading, mathematics and science), w; =1 
for any j. It emerges that mathematics is largely the learning dimension where 
educational poverty is higher, while the reading is the attribute that contributes more 
to learning deprivation only for five countries. The weight of attributes on the 
overall educational poverty is quite similar for nine countries. Only in four 
countries, the contribution of the learning attribute is higher than 40%, three times 
for mathematics, one for reading. 
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Table 2: Multidimensional index of Alkire-Foster, PISA 2015 scores. 


Count Alkire Foster Index AF Index Decomposition 
1y Estimate Std. err. Reading Mathematics Science 

AUS 0.194 0.003 31.57 38.08 30.35 
AUT 0.217 0.005 34.05 33.82 32.13 
BEL 0.197 0.004 32.83 33.24 33.93 
CAN 0.119 0.003 29.12 38.94 31.94 
CHE 0.182 0.005 36.97 28.56 34.47 
CHL 0.375 0.006 25.05 43.98 30.97 
CZE 0.215 0.005 34.3 33.67 32.03 
DEU 0.165 0.004 31.47 34.19 34.34 
DNK 0.147 0.004 34.68 29.32 36 

ESP 0.19 0.004 27.61 39.68 32.71 
EST 0.104 0.004 35.99 37.39 26.62 
FIN 0.12 0.004 29,2 38.49 32.31 
FRA 0.221 0.005 31.99 35.05 32.96 
GBR 0.196 0.004 31.21 38.83 29.96 
GRC 0.322 0.006 28.35 37.06 34.59 
HUN 0.272 0.006 33.97 34.35 31.68 
IRL 0.135 0.004 24.16 36.97 38.87 
ISL 0.242 0.006 30.87 33.32 35.81 
ISR 0.3 0.005 29.42 35.89 34.69 
ITA 0.225 0.005 30.03 35.3 34.67 
JPN 0.112 0.003 40.01 31.32 28.67 
KOR 0.146 0.004 32.41 35.18 32.41 
LUX 0.257 0.005 33.48 32.84 33.68 
LVA 0.186 0.005 30.56 38.78 30.66 
MEX 0.486 0.006 28.79 38.89 32.32 
NLD 0.176 0.005 33.79 31.26 34.95 
NOR 0.167 0.004 28.58 34.15 37.27 
NZL 0.189 0.005 30.65 38.1 31,25 
POL 0.162 0.005 30.62 35.41 33.97 
PRT 0.198 0.005 28.99 40.56 30.45 
SVK 0.304 0.005 35.26 31.15 33.59 
SVN 0.154 0.004 32.68 34.55 32.77 
SWE 0.2 0.005 30.37 33.45 36.18 
TUR 0.453 0.006 29.77 37.21 33.02 
USA 0.227 0.005 28.45 42.43 29,12 
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Quasi-Maximum Likelihood Estimators For 
Functional Spatial Autoregressive Models 


Quasi-verosimiglianza stima per funzionale spaziale 
autoregressivo modello 


Mohamed-Salem Ahmed, Laurence Broze, Sophie Dabo-Niang, Zied Gharbi 


Abstract We propose a functional linear autoregressive spatial model where the ex- 
planatory variable takes values in a function space while the response process is 
real-valued and spatially autocorrelated. The specificity of the model is the func- 
tional nature of the explanatory variable and the structure of a spatial weight matrix 
which defines the spatial relation and dependency between neighbors. The estima- 
tion procedure consists in reducing the infinite dimension of the functional explana- 
tory variable and maximizing a quasi-maximum likelihood. We establish both con- 
sistency and asymptotic normality of the regression parameter function estimate. 
We illustrate the skills of the methods by some numerical results. 

Abstract In questo lavoro si propone un modello lineare spaziale autoregressivo in 
cui la variabile esplicativa prende valori in uno spazio di funzioni e la variabile 
risposta a valori reali spazialmente correlati. La specificit del modello risiede 
nella natura funzionale della variabile esplicativa e nella struttura della matrice 
di prossimit i cui elementi definiscono la relazione spaziale e la dipendenza tra i 
vicini. La procedura di stima consiste nel ridurre la dimensione infinita della vari- 
abile esplicativa funzionale e nel massimizzare una quasi-verosimiglianza. Vengono 
stabiliti consistenza e normalit asintotica dello stimatore funzionale del parametro 
di regressione, la cui performance viene illustrata attraverso uno studio di simu- 
lazione. 
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Key words: Functional Linear Models, Spatial Autoregressive process, Quasi- 
Maximum Likelihood Estimators, ... 


1 Introduction 


This work concerns two different research areas; spatial econometric and functional 
data analysis. Functional random variables are spreading in statistical analyses due 
to the availability of high frequency data and of new mathematical strategies to deal 
with such statistical objects. The field is known as Functional Data Analysis (FDA). 
Applications of FDA are growing across fields. The functional variables are mainly 
curves, surfaces or manifolds. For an introduction to this field as well as illustrations 
and applications, see [21]. 

In many fields as urban system, agricultural, environmental sciences or economic 
and many others, one often deals with spatially dependent data. Therefore, mod- 
elling spatial dependency in statistical inferences (estimation of spatial distribution, 
regression, prediction, ...) is a significant feature of spatial data analysis. Spatial 
statistic provides tools to solve such problem. Various spatial models and methods 
have been proposed, particularly within the scope of geostatistics. So far, most of 
spatial modelling methods are parametric and concern non-functional data. Sev- 
eral types of functional linear models for independent data have been developed 
over the years, serving different purposes. The most studied is perhaps the func- 
tional linear model for scalar response, originally introduced by [14]. Estimation 
and prediction problems for this model and some of its generalizations have been 
tackled mainly for independent data (see, e.g., [7], [17], [6], [9]). Some works exist 
on functional spatial linear prediction using krigging methods (see, e.g., [18], [11], 
[12], [15] [10]), so highlighting the interest of considering spatial linear functional 
models. We are interested in a functional spatially autoregressive linear model. One 
of the well known spatial model is the Spatial Autoregressive Model (SAR) by [5] 
that extends regression in time series to spatial data. The structure of this model 
and its estimation has been developed and summarized by many authors as [1], [8], 
[4] among others. More recently, [16] proposed the Quasi-likelihood estimator for 
the SAR model for real-valued data and investigated its asymptotic properties under 
the normal distributional specifications. We extend the previous model to the case 
where the covariate is a functional random variable. In the following, we provide 
the functional SAR (FSAR) and its Quasi-likelihood estimator (QML). 


2 The model 


We consider that at n spatial units, we observe a random real variable Y consid- 
ered as response variable and a functional covariate {X(t),t € 7} considered as 
explanatory function corresponding to a square integrable stochastic process on the 


Quasi-Maximum Likelihood Estimators For Functional Spatial Autoregressive Models 25 


interval 7 C R. Assume that the process {X (t),t € 7} takes values in some space 
X C L°(7), where L?(Z) is the space of square integrable functions in 7. The 
spatial dependency structure between these n spatial units is described by a non- 
stochastic spatial weights n x n matrix W, that depends on n. The elements W;jn of 
this matrix are usually considered as inversely proportional to distances between 
spatial units i and j with respect to some metric (physical distance, social networks 
or economic distance, see for instance [20]). Since the weight matrix changes with 
n, we consider these observations as triangular arrays observations. This is required 
to investigate an asymptotic study of the following model that describes the rela- 
tionship between the response variable Y and the covariate function X(.) [22]. We 
assume that this relationship is modeled by the following Functional Spatial Autore- 
gressive Model (FSAR): 


n 
Y=20L Wink + f XOPOdA+U, i=l,....n,n=1,2,... (1) 
fl x 


where Ao (in a compact space A) is the autoregressive parameter, B*(-) is a parame- 
ter function assumed to belong to the space of functions L?(7), and (Wijn) j=1,....n iS 


the i-th row of W,. The disturbances {U;, i= 1,...,n, n= 1,2,...} are assumed to 
be independent random Gaussian variables such that E(U;) = 0, E(U?) = 03. They 
are also independent to {X;(t),t € 7, i= 1,...,n, n= 1,2...}. We are interested 
in estimating the unknown true parameters Ag, B*(.) and oj. Let X,(B*(.)) be the 
nx 1 vector of i-th element fz X;(t)B*(t)dt, then one can rewrite (1) as 


SYn=Xn(B())+Un,  n=1,2,. (2) 


with S, = (In —AoW,), Y, and U, are two n x 1 vectors of elements Y; and U;, i = 
1,...,n respectively, /, denotes the n x n identity matrix. 

Let S,(4) = In — ÀW, and V;(4,B(.)) = Sn(A)¥n — Xn(B(.)) so the conditional log 
likelihood function of the vector Y,, given {X;(1),t € 7, i=1,...,n,n=1,2...} 
is given by : 


L,(A, B(.),62) = Ino” ina + In)Sy(2)|— 254 (AB) (4,80) 6) 


Maximum likelihood estimates of Ao, B*(-) and oj are the A, B(-), and o? that 
maximise (3). But this likelihood cannot be maximized without addressing the dif- 
ficulty produced by the infinite dimensionality of the explanatory random function. 
To deal with this problem, we project as usual, the functional explanatory variable 
and parameter function in a space of functions generated by a basis of functions with 
a dimension that increases asymptotically as the sample size tends to infinity. Sev- 
eral truncation techniques exist. [2] proposed to use the estimated eigenbasis of the 
sample, [3] were limited to a Spline basis adding a penalty that controls the degree 
of smoothness of the parameter function. [19] proposed to use any basis of functions 
which verifies some truncation criterion. We shall adapt the alternative proposed by 
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[19] in order to resolve infinite dimensional problem of the functional space. This 
method will be denoted Truncated Conditional Likelihood Method. 


3 Truncated Conditional Likelihood Method 


Let {@;, j= 1,2,...} be an orthonormal basis of the functional space L?(7), usu- 
ally a Fourier or a Spline basis or a basis constructed by the eigenfunctions of the 
covariance operator I” where this operator is defined by : 


x(t) = [BR OX())x(s)as, PIANTA (4) 
The operator T is a linear integral operator whose integral kernel is 
K(t,v)=E(X(1)X(v)), foralli,ve 7. (5) 


It is a compact self-adjoint Hilbert-Schmidt operator because 


fike v)|?dtdv < G (fa) Zos 


thus, it can be diagonalized. 


We can rewrite X(.) and B*(.) in the following way : 


X(t) = LV €;@;(1) and Bt) = LV Broj) forall t € F 


j21 J21 


where the real random variables £; and the coefficients Bi are given by 


e= | XOP and B; = | BO oj(nar. 


Let p, be a positive sequence of integers that increases asymptotically as n — ©, 
by the orthonormality of the basis {@;, j = 1,2,...}, we can consider the following 
decomposition 


00 Pn 00 
JOB =) pe; = Y} pet DY Biej. (6) 
é j=l j=1 j 


J=Pnt 


The truncation strategy introduced by [19]! consists of approximating the left-hand 
side in (6) by using only the first term of the right-hand side. This is possible when 
the approximation error vanishes asymptotically, where this error is controlled by 


' Note that our model can be viewed as a particular case of the generalized functional linear models 
of [19] with the identity link function and Aq replaced by zero. 
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a square expectation of the second term in the right-hand side of (6). In particular, 
the approximation error vanishes asymptotically when one consider the eigenbasis 
of the variance-covariance operator, by 


J=Pnt+1 J=Pnt+1 


2 
e( È pe) = LV BFE) = L BFA 
J=Pnt1 


where Àj, j = 1,2... are the eigenvalues. Under the truncation strategy, X,(B*(.)) 
will be approximated by &,,, B* where B* = (Bî,... Nichi and €,,, isa nX py matrix 
of (i, j)-th element given by 


eP = / 09, cicala 


Now the truncated Conditional Log Likelihood function can be obtained by replac- 
ing in (3) X,(B(.)) by &,, B for all B(.) € L?( 7) and B € RP». The corresponding 
and feasible Conditional Likelihood is 


1 
20? 


with V (A,B) = Sa (4)Yn— ép, B. For a fixed A, (7) is maximized at 


L(A,B,9°) = -5Ino?— FIn2a-+In|Sy(A)|—s—5Vn(A,B)ValA,B) © 


2 


BA) = (Ep, Sen) "Epp SnlA)¥n (8) 
and 
OA) = 1 (5,(A)Y,- E,B(A)) (SAY: -En BA) 
= ZY, S,(2)MiSn(A)Yn 0) 


where M, = I, — &), (ép, Cl: and A’ denotes the transpose of a matrix A. 
The concentrated truncated Conditional Log likelihood function of A is: 


n 


2 


1, (A) = -2 (n(27) +1) sng; (4) + In|S,(A)]. (10) 


Then the estimator of Ao is A that maximizes L,,(A), and B(A), 62(A) are the esti- 
mators of B* and o respectively and denoted by : 


Bl) = HO 


and i 
ô’ (À) = a YnSn (A) MnSn(À Xn: 
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To get identifiability of Ap, B*, and o in the truncated model, notice that 


E(L,(A,B,07)) = “Ino jin2a-+ In|5,(A)| aga (Vi(A.B)Val(A-B)). 


We have 
E(V,(A,B)Va(2,B)) a (Sn(A)¥n — 68) 
= E ((S:(A)S,"Xn(B*()) EnB) (SAS XB") En B)) + 
ptr (An(A)) 
= E ( (Su(A)S;'Ep,B* — EB) (Sn(4)S5 "Spa" — Ep B) ) + 
E (RyAn(A)Rn) + 
GBtt (An(A)) + 2E ((Sn(4)S; "EB" — &mB) Ra(B*(.))) 


where An(4) = SS) (A)Sn(4)S7! a Ra (B*(.)) = Sn(A)S;!Rn, with elements 
Ri = (Xn(B*() — EnB”) = Lisp, Bie”. 


The truncation strategy ensures that 


E ( (SSR EnB" — p,B) Ru(B"(.))) =0(1) and E(RyAn(A)Rn) = 0(1). 


Indeed, in one hand 


E ( (Su(2)S;'Sp,B") Ra(B*())) =E (En BY AAR) = aD 


(AA) © BEBSE (ees), 


r=15>Pn 


and 


E ( EnB) Ra(B"())) = E (EnB SOS R) =E (AS) È E BREE). 


r=18>Pn 
(12) 
The right hand side terms in (11) and (12) are zero when we consider the eigen- 
basis, otherwise we need to assume that 


Patt (An(A)) $, BSE (e)=0(1) and prte (SK(A) Y IBIES) =0(1). 


S>Pn S>Pn 
(13) 


On the other hand, we have 


E(RyAn(A)Rn) =tr(An(4)) E E BEBE (Eres). (4) 


F> Pn S>Pn 
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When using the eigenbasis, the term in the right hand side of (14) is of order 


tr(A,(A)) Y BYE (e (15) 
S>Pn 
otherwise it is of order 
2 
tr(An(4)) (z IBS| ee) (16) 
S>Pn 


In other words, the term (15) or (16) need to be of order 0(1). 
If this is the case, then 


E(L,(A,B,o)) = In|S,(A)| 


sat (SAST EB" EnB) (aSr EnB- EnB) ) 
5 (Ino? +In27) So ar (4,(A))+0(1). 


Let In +AoGn = Si! where G, = W,S7!, therefore Sn(4)S7! = In + (Ao — å )Gn for 
all A E A. 
Now for a fixed A, E (Z,(A,B,07)) is maximum with respect to B and 0? at 


B°(A) = AE E 8°, 


where Ip, = 1E (4 Sx) is symmetric and positive definite in case of an eigenba- 
sis. In addition, we have 


o?) = a SCAS Epb" — EnB A) (AS "Sp, B* — EmB*(A))) + 


i tr(A,(A)) 
= 20 -APE ( (Gob nB°Y (Gain Emly," (En, GabmB"))) + 
O r (An (A). 


It is clear that B* (20) = B* and ož? (20) = of. However identifiability of B* and of 
depends on that of Ao. Let 


n n 
On(2) =E (En (A,B*(4),0;°(4))) =In|S,(A)] — F1n6,?(A)—F (1+ In2z) +0(1). 

(17) 
therefore proving the identifiability of Ag is equivalent to show that Ag maximizes 


On(4). 
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We provide the asymptotic properties of the proposed QMLE estimators of Ao, B*, 
and 07 assuming some basic conditions; some of them concern the error disturbance 
and the weighted matrix, others guarantee the equilibrium of the system (1) or deal 
with the linearity of In|S„ (A)|. Under theses assumptions, we prove the convergence 
in probability and the asymptotic normality of the estimators. We illustrate the skills 
of the methods by some numerical results. 
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A clustering algorithm for multivariate big data 
with correlated components 


Un algoritmo di clustering per big data multivariati con 
componenti correlate 


Giacomo Aletti and Alessandra Micheletti 


Abstract Common clustering algorithms require multiple scans of all the data to 
achieve convergence, and this is prohibitive when large databases, with millions of 
data, must be processed. Some algorithms to extend the popular K-means method 
to the analysis of big data are present in literature since 1998 [1], but they assume 
that the random vectors which are processed and grouped have uncorrelated com- 
ponents. Unfortunately this is not the case in many practical situations. We here 
propose an extension of the algorithm of Bradley, Fayyad and Reina to the process- 
ing of massive multivariate data, having correlated components. 

Abstract I comuni algoritmi di clustering richiedono di esaminare più volte tutti i 
dati per raggiungere la convergenza, e cio risulta proibitivo quando devono essere 
analizzati database enormi, con milioni di dati. In letteratura sono presenti fin dal 
1998 [1] alcuni algoritmi che estendono il popolare metodo K-medie all’analisi di 
big data, ma essi assumono che i vettori aleatori che vengono analizzati e raggrup- 
pati abbiano componenti non correlate. Purtroppo tale condizione non é soddisfat- 
ta in molti casi pratici. Qui proponiamo un’estensione dell’algoritmo di Bradley, 
Fayyad e Reina all’analisi di grandi moli di dati multivariati, con componenti cor- 
relate fra loro. 


Key words: big data, clustering, K-means, Mahalanobis distance 
1 Introduction 


Clustering is the division of a collection of data into groups, or clusters, such that 
points in the same cluster have a small distance from one another, while points in 
different clusters are at a large distance from one another. When the data are not very 
high dimensional, but are too many to fit in memory, because they are part of a huge 
dataset, or because they arrive in streams and must be processed immediately or 
they are lost, specific algorithms are needed to analyze progressively the data, store 
in memory only a small number of summary statistics, and then discard the already 
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processed data and free the memory. Situations like this, in which clustering plays 
a fundamental role, recur in many applications, like customer segmentation in e- 
commerce web sites, image analysis of video frames for objects recognition, recog- 
nition of human movements from data provided by sensors placed on the body or on 
a smartphone, etc. The key element in smart algorithms to treat such type of big data 
is to find methods by which the summary statistics that are retained in memory can 
be updated when each new observation, or group of observations, is processed. A 
first and widely recognized method to cluster big data is the Bradley-Fayyad-Reina 
(BFR) algorithm [1, 7], which is an extension of the classical K-means algorithm. 
The BFR algorithm responds to the following data mining desiderata: 1) Require 
one scan of the database and thus ability to operate on forward-only cursor; 2) On- 
line anytime behavior: a ”best” answer is always available, with status information 
on progress, expected remaining time, etc. provided; 3) Suspendable, stoppable, re- 
sumable; incremental progress can be saved in memory to resume a stopped job; 4) 
Ability to incrementally incorporate additional data with existing models efficiently; 
5) Work within confines of a limited RAM buffer; 6)Utilize a variety of possible 
scan modes: sequential, index, and sampling scan, if available. The BRF Algorithm 
for clustering is based on the definition of three different sets of data: a) the retained 
set (RS): the set of data points which are not recognized to belong to any cluster, 
and need to be retained in the buffer; b) the discard set (DS): the set of data points 
which can be discarded after updating the sufficient statistics; c) the compression 
set (CS): the set of data points which form smaller clusters among themselves, far 
from the principal ones and can be represented with other sufficient statistics. Each 
data point is assigned to one of these sets on the basis of its distance from the center 
of each cluster. The main weakness of the BFR Algorithm resides in the assumption 
that the covariance matrix of each cluster is diagonal, which means that the compo- 
nents of the analyzed multivariate data should be uncorrelated. In this way at each 
step of the algorithm only the means and variances of each component of the clus- 
ter centers must be retained. In the following we will describe an extension of the 
BFR algorithm to the case of clusters having ’’full’ covariance matrix. Since with 
our method also the covariance terms of the clusters centers must be retained, there 
is an increase in the computational costs, but such increase can be easily controlled 
and is affordable if the processed data are not extremely high dimensional. 


2 An extension of the BFR clustering algorithm 


We will use the same three sets of data a)-c) introduced in the BFR algorithm, but 
using different summary statistics to define the discard set and the compression set. 
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2.1 Data Compression 


Like in the BFR algorithm, primary data compression determines items to be dis- 
carded (discard set DS), and updates the compression set CS with the sufficient sum- 
mary statistics of the identified clusters. Secondary data-compression takes place 
over data points not compressed in primary phase. Data compression refers to rep- 
resenting groups of points by their sufficient statistics and purging these points from 
RAM. In the following we will always represent vectors as column vectors. Assume 
that data points x1,...,Xn € RP? must be compressed in the same cluster. We will re- 
tain only the sample mean x, = ¥°"_, x;, and the unbiased sample covariance matrix 
Sn = 1; LL (x— 3) (x— 3)! . These two sufficient statistics can be easily computed 
by keeping in memory the following quantities: 


n 


n n 
n, sumprody(n) = Vai, sumprodcrossy (n) = y y XikX jl, 
i=l i=l j=l 


n 
sumsqx(n) = Y Xp sum(n) = Ý xjr, k,l=1,...,p,k<l. 


These sufficient statistics can be easily updated when a new data point x,4j must 
be added to the cluster, without processing again the already compressed points. In 
fact, for k,l = 1,...,n, k < L, we have 


n+ 


sumprody(n+ 1) = LV XikXiı = Sumprodk (n) + X(n+1)kX(n+1)1 


i= 


n+1n+1 

sumprodcrossyj(n+ 1) = LV LV XikX jı = sumproderossy (n) +x(n+1)Sum(n) 
i=l j=l 
+x(n4 1) SUME(1) +x(n4 EX (D-IL 


n+ 


sumsqy(n+ 1) = LV x = sumsqx(n) +X tt) 


sumy(n+1) = Y xj = sum (n) +x(n41)k 


Thus at each step of the algorithm we have to retain in memory only p? + p +1 
sufficient statistics for each cluster, where p is the dimension of the data points. In 
addition, note that we should simply sum the corresponding statistics if we want to 
merge two clusters. 


2.2 The covariance matrices of the clusters 


Note that when a new cluster is formed, it contains too few data points to obtain 
a positive definite estimate of the covariance matrix, using the sample covariance 
matrix, at least until n < p. This is a problem since we need to invert this matrix to 
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compute the Mahalanobis distance, that we will use to assign the observations to the 
clusters. Recent research methods in estimating covariance matrices include band- 
ing, tapering, penalization and shrinkage. We have focused on the Steinian shrink- 
age method since, as underlined in [8], it leads to covariance matrix estimators that 
are non-singular, well-conditioned, expressed in closed form and computationally 
cheap regardless of p. We use the diagonal matrix Ds of the sample covariance 
matrix S as “target matrix” of the shrinkage method, noting that Ds was the BRF 
estimate of the covariance of each cluster used in [1]. In other words, in presence 
of few data, our method coincides with that of [1], and we allow a progressive in- 
fluence of correlation as the number of data increases. Summing up, we use a linear 
shrinkage estimator for the covariance matrix, like that proposed in [3, 4, 6, 8] of 
the form $ = (1—A)S+ADs, where S is the sample covariance matrix, Ds is its 
diagonal matrix, and A is a parameter in [0, 1], whose optimal value depends on the 
number n of data in the cluster. The parameter A is initially settled to 1, and then its 
value is decreasing to 0 when n + co. The theoretical optimal value A* of A is found 
by minimizing the risk function relative to the quadratic loss E[||§ — 2 || (see, e.g., 
[8, 6]) and it is a ratio depending on the unknown £. When data are gaussian, the 
procedure proposed in [3] may be directly implemented to obtain unbiased estima- 
tors of numerator and denominator in the formula of 4*. In non-gaussian setting, 
a bias due to the fourth moment is present in the numerator and it is corrected [6] 
with the use of further statistics, as the Q-statistics introduced in [4] (see also [2]). 
Unfortunately, it is not possible to compute the Q statistics on the basis of updat- 
able sufficient statistics, as in our framework. To correct the bias, a new iterative 
procedure based on three updatable statistics for each cluster has been successfully 
developed. 


2.3 Model update 


Like in the BFR algorithm, the second step of our algorithm consists of performing 
K-means iterations over sufficient statistics of compressed, discarded and retained 
points. In order to assign a point to a cluster we use the Mahalanobis distance from 
its center (sample mean), i.e. we assign a new data point x to cluster h with center 
Š, and estimated covariance matrix Ŝŝ h» if h is the index which minimizes A (x, X;,) = 
(x— 3n)" ($)! (x — xy), and if A (x, 3}) is smaller than a fixed threshold 8. We also 
compare x with each point x, in the retained set (RS), by computing A(x,x,) = 
z- xo)” (Sp)~!(x — xo), where Sp matrix is the pooled covariance matrix based on 
all Sp: 


$ o (Mp, 1)Si, t (Mh, 1)$h, tte (Any = Shy 


(1) 
nhy HAm ++ + tiny —M 


and where ny is the number of points in cluster h. With Sp, we emphasize the 
weighted importance of directions that are more significant for the clusters when we 
compute the distance between two “isolated” points. We then approximate locally 
the distribution of the clusters with a p—variate Gaussian and we build a confidence 
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regions around the centers of the clusters (see [5]). We then move x, in the farthest 
position from x in its confidence region, while we move the centers of the other 
clusters in the closest positions with respect to x and we check if the cluster center 
closer to x is still X}. If yes, we assign x to cluster h, we update the corresponding 
sufficient statistics and we put x in the discard set; if the point is closer to a point xo 
of the retained set than to any cluster, we form a new new secondary cluster (CS) 
with the two points and we put x and x, in the discard set; otherwise, we put x in 
the retained set (RS). 


2.4 Secondary data compression 


The purpose of secondary data compression is to identify “tight” sub-clusters of 
points among the data that we can not discard in the primary phase. In [1], this is 
made in two phases. In the first one, a K-means algorithm tries to locate subclusters 
that are merged if they meet a “dense” condition. The candidate merging clusters 
are chosen sequentially based on a hierarchical agglomerative clustering build on 
the subclusters. In all this procedure, the euclidean metric was adopted. Finally, the 
number of clusters is initialized to K, and it can increase or decrease during the 
procedure. We adopt the same general idea, but we modify the procedure. First, we 
change the metric, by taking the pooled covariance $p given in (1). As for isolated 
points, we think that this metric is more precise than the euclidean one for this stage. 
Then, a hierarchical clustering is performed using the Ward’s method: the distance 
between two clusters h; and h2 with nj, ,nj;, points and centroids Xp, and Xp,, is 
given by 
Nhi Anm 


A(A,B) = 
(A,B) ION 


= bs ATA = 
(Xn, ~ Rm) Sp(Xn — Xn) 

Note that we sequentially merge two clusters only if a suitable dense condition is 
fulfilled. For example, the total variance (i.e., the trace of the sample covariance 
matrix) of the union of the two is required to be smaller than a suitable proportion 
of the sum of the total variances of the single groups. 


3 Results on simulated data 


Synthetic data were created for the cases of 5 and 20 clusters. Data were sampled 
from 5 or 20 independent p-variate Gaussians, with elements of their mean vectors 
(the true means) uniformly distributed on [—5,5]. The covariance matrices were 
generated by computing products of the type £ = UHU, where H is a diagonal 
matrix with elements on the diagonal uniformly ditributed on [0.7, 1.5], and U is the 
orthonormal matrix obtained by the singular value decomposition of a symmetric 
matrix MM”, where the elements of the p x p matrix M are uniformly distributed 
on [—2, 2]. In either cases of 5 or 20 clusters, we generated 10.000 vectors for each 
cluster, having dimensions p = 10,20,50. This procedure guarantees that these clus- 
ters are fairly well-separated Gaussians, an ideal situation for K-Means. We applied 
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our procedure to these synthetic data, and we computed the secondary data compres- 
sion after each bucket of 50 or 100 data points. The results are reported in Table 1. 
We note that the number of clusters is sometimes overestimated, in particular when 
the dimension p of the data points is small, which corresponds to the case where 
the clusters are less separated. In such cases, if the point clouds in different clusters 
are gathered in particularly elongated” and rather close ellipsoids, then the correct 
detection of the clusters may be more difficult. We also note that in case of overesti- 
mation of the number of clusters, many of them are composed by 2 or 3 data points, 
which can then be revisited as small groups of outliers. The method seems to be 
almost unsensitive to the buckets size. We conclude that the method here proposed 
provides rather good results on synthetic data, even if some improvement could be 
considered for the secondary data compression. The method is also under testing on 
real data. An accurate comparison with the BFR algorithm will also be performed. 


n. of dimension p |n. of data n. of n. of small|n. of retained 
true clusters |of data points |in each bucket |estimated clusters|clusters |points (outliers) 
5 10 50 7 1 0 
5 20 50 5 0 1 
5: 50 50 5 0 0 
5 10 100 8 1 0 
5 20 100 5 0 1 
5 50 100 5 0 0 
20 10 50 29 6 8 
20 10 100 29 6 8 
Table 1 Results of the application of the proposed algorithm to synthetic data. By small clusters 
we mean clusters containing less than 4 data points 
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A Bayesian semiparametric model for terrorist 
networks 


Un modello Bayesiano semiparametrico per reti 
terroristiche 


Emanuele Aliverti 


Abstract A recent field of research employs network-analysis’ tools to the dark 
network framework, in which pairwise informations about terrorists’ activities are 
available. In this work we focus on the “Noordin Mohamed Top” dataset, develop- 
ing an asymmetric approach that treats one network as response and the remaining 
as covariates. The objective is to identify which information may be useful in pre- 
dicting terrorists’ collaboration in a bombing attack, identifying at the same time 
the most influential subjects involved in these dynamics. Such aim is addressed 
through an asymmetric Bayesian semi-parametric model for networks that, through 
a suitable prior specification, integrates a flexible regularization and the detection of 
leading nodes. Taking advantage of the Pélya-Gamma data augmentation scheme, 
we develop an efficient Gibbs sampler to make inference on the parameters involved. 
Abstract Un recente ambito di ricerca impiega strumenti tipici dell’analisi di reti 
nei contesti di dark networks, nei quali sono disponibili informazioni riguardanti 
attivita terroristiche sotto forma di legami a coppie. In questo lavoro ci conen- 
traimo sul dataset relativo a “Noordin Mohamed Top”, sviluppando un approc- 
cio asimmetrico che considera una particolare rete come risposta, e le rimanenti 
come esplicative. L’obiettivo identificare quale informazione possa essere utile per 
predirre la collaborazione di diversi terroristi in un attentato, identificando con- 
temporaneamente i pi influenti soggetti coinvolti in queste dinamiche. Il problema é 
affrontato tramite un modello Bayesiano semiparametrico per reti che, attraverso un 
opportuno specificazione delle distribuzioni a priori, incorpora al suo internouna 
regolazione flessibile e l’identificazioe dei nodi leader. Sfruttando lo schema Pólya- 
Gamma per dati aumentati, presentiamo un efficiente Gibbs sampler per fare in- 
ferenza sui parametri coinvolti. 


Key words: Terrorism, networks, Bayesian semiparametrics, latent space, spike- 
and slabs prior, matrix factorization 
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1 Introduction 


After September 11th, intelligence agencies of different countries employed tools 
of the network science to serve in the fight against terroristic groups, often named 
dark networks. Great effort has been made to develop tools for identifying key play- 
ers, that is actors within the network reporting high values in terms of some suitable 
network statistics. Since aggressive strategies encountered different failures, and the 
necessity of more sophisticated approaches became evident: [7] for example propose 
to focus on approaches less aggressive than direct military operations, involving a 
subtle application of informatics tools in order to gather different informations from 
various sources. The proper interpretation of retrieved data may provide a deeper de- 
scription of terrorism, embracing at the same time social, economics and personal 
aspects, thus useful to develop strategies to defeat the roots of criminals associations. 


Our motivating approach rises from the “Noordin Mohamed Top” dataset, drawn 

from a publication of the International Crisis Group; it consists of different ties 
among terrorists of the most ruthless group of the southwest Asia. 
Data are coded into 10 symmetric relationships between network’s leader, No- 
ordin Mohamed Top, and 78 affiliates, thus naturally coded into a multilayer simple 
graphs, that is a structure G = {V, Ex} where nodes (elements of V) represent terror- 
ists and edges (unordered pairs situated in the set Ex) the presence of the particular 
k-th relationship among two generic subjects. 

We expect a certain degree of association among different relationships, since 
they’re defined over the same set of nodes. Therefore, we would like to propose 
an approach able to efficiently use the information held inside “simpler” network 
in order to predict and make inference on the most interesting one, which is the 
network referred to the co-participation at the same terroristic bombing.. 


2 Proposed approach 


Our research objectives can be faced by setting up an asymmetric framework, that 
threats one network as response and the remaining as covariates. The proposal of [3] 
is the most appropriate, and hence we will adapt this approach to our purposes by in- 
cluding nodal random effects and a non-parametric matrix factorization that avoids 
the estimation of different models. Let v the number of nodes of each network, Y 
the v x v adjacency matrix referred to the response network and X the v x v x p array 
containing the p adjacency matrices referred to the p explanatory networks. We will 
consider only undirected and unweighted network (simple graphs), so adjacency 
matrices associated at are all dichotomous, symmetric and with non-defined ele- 
ments on the main diagonal. Hence y;j = yj; € {0, 1} and x;j;x=xjix € {0,1} Vi, j,k. 
Since the response network can assume only two values (presence or absence of 
edges), it is reasonable to assume a conditional bernoulli distribution for the under- 
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lying generative mechanism. We parametrize 7;;, the probability of observing an 
edge between node i and node j, through its log-odds 6;;. Formally: 


5 T; 
Vij|Tij o) Bin(1,%;;) 6;; = boe( 724) 
1- Tij 


Furthermore, we decompose the linear predictor 6;; into two components: the 
first can be regarded as a parametric mixed model component, while the second as 
a non-parametric matrix factorization. 


P H 
Oj = +) [Be +bir+bj]xij + $ Anzinzjn 
k=1 h=1 
SS —— _—_— 
Parametric component: Non-parametric 
fixed and random effects component (1) 


aeR BER k=1,...,p zn ER,An ER? i=1,...,v 
b; = (bi,..-, dip) ~G, i=1,...,V 


The parametric component describes the relationship between networks and de- 
tects potentially influential nodes, which in this application means subjects whose 
role in some relationships has been particularly different from the average one. The 
basic interpretation is the following: œ provides an indication of the density of the 
response network, as an ordinary intercept in the binomial regression. Coefficient 
Pk are fixed effects in a logistic regression, that is the mean variation in the log 
odds of the outcome attributable to the k-th explicative network. In order to take 
advantage of the explicative power of covariates networks, we introduce additive 
random effects referred to the generic nodes i and j involved in the (i, j)-th dyad. 
In 1 by; represents the specific deviation of the i-th node from the main effect fx, 
and so can account for his particular propensity in building ties in the response net- 
work. For each relationship, the purpose is to identify subjects more (or less) likely 
to commit an attack with, providing thus a brighter description of those dynamics. 
Furthermore, additive random effects can account for between-rows heterogeneity 
contained in the explanatory networks, allowing then a better estimation of the fixed 
counterpart. 

The non-parametric component decompose the residual among response and ex- 
planatory networks in a flexible way, that is through a matrix factorization that al- 
lows the number of factors to vary adaptively. It can be interpreted as a latent space 
whose size is at most equal to H, in which zj, represents the h-th latent coordi- 
nate of the i-th node, while A, defines the importance of the h-th dimension of the 
latent space in defining the final model. This strategy aims to adaptively account 
for the dependencies in the response not seized by explanatory networks, providing 
estimates for the parametric component deprived of potentially confounding factors. 
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3 Prior distribution and posterior simulation 


For a complete Bayesian definition of the proposed model we need to specify proper 
prior distributions for the set of parameters involved. 


3.1 Parametric component 


We specify zero-mean normal distributions over the fixed effects parameter. For- 
mally, 


z(a) ~N (Hog, Fa) TB) = (Bi; ---; Bp) ~ Np(Mog -Zpo) (2) 


In our application, we expect a certain level of heterogeneity in nodes’ behavior, 
both between different subject and within the same, when involved in different rela- 
tionships. For example, it’s reasonable that dealing directly with leaders may led to 
a higher propensity in participating at the same terroristic attack. However, certain 
subjects may have had a central role just in the some specific relationships, such 
as the school recruitment network, and a marginal position elsewhere; for that, we 
need a prior distribution able to differentiate particular subjects from standard ones, 
and hence we specify a spike and slabs prior distributions [4] independently for each 
p-dimensional vector referred to the generic i-th subject, i = 1,...,v. Formally: 


G~N(0,Ii), Tj = diag(¥1,---, %p): Ve = Oik Th k=1,...,p 
(ix) X (1 — wi) Bip (-) + wid (-) 3) 
T(t,°) ~ Gamma(d,, da), (wi) ~ Uniform [0, 1] 


In 3 vo is a value close to zero, and the hyper-parameters d,,dz are chosen in order 
to obtain, for Y% = @k Ts a continuous distribution characterized by a spike in vo and 
a continuous right tail; CA and 6; are Formally, a Multiplicative Inverse Gamma 
(MIG) is specified as prior probability measure over the loading elements A, in 1, 
and standard Gaussian distribution for the latent coordinates. See [2] for a recent 
discussion regarding the properties of the MIG prior. Formally: 
zn É N(0,1),  i=1,...,v 

hol iid ® 

An, = II a 6, ~ Gamma(a1,1), 0,>, ~ Gamma(az, 1) 
m=l m 


with a; > 0 and a2 > 1 fixed hyper parameters. 
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3.2 Posterior Simulation 


Adapting the Polya-Gamma data augmentation strategy proposed by [6] in the lo- 
gistic regression framework, we can obtain the full-conditional distributions for the 
parameters involved in our model, and hence implement a Gibbs sampling strategy. 


4 Results 


The effects of different network is heterogeneous: for example, a tie in the com- 
munication network increments, in mean, the log odds of collaborating in the same 
bombing operations of around 2 times; furthermore, if two terrorist had been in 
the same terroristic organization the log odds is lowered of an amount around 1.33 
times, that is not so trivial. As for influential nodes, the spike and slabs strategy iden- 
tify several terrorists, confirmed to be such in the Indonesian reports. The predictive 
performance recorded an an average area under the ROC curve equal to 0.864, a 
false positive rate equal to 0.225 and a total negative rate of 0.220, using as esti- 
mates for the missing edges the mean of the posterior predictive density and, where 
needed, the overall density of the response network as cutoff value. 
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Emerging challenges in official statistics: new 
sources, methods and skills 


Giorgio Alleva 


Official statistics is challenged to provide an increasingly complete picture of the 
complexity of our societies with a compelling demand for data. At the same time, it 
is facing human resources and budget constraints. The development and dissemina- 
tion of new digital technologies have removed many obstacles, first of all the cost 
for production, storage and analysis of information. A leap forward in efficiency is 
therefore in order, if we are to meet our responsibilities and to guarantee ever in- 
creasing quality standards. As sampling surveys are expensive, response rates are 
decreasing and response burden must be reduced, data collection need to be op- 
timised. The emergence of new data sources and availability of Big data and the 
opportunity of a massive exploitation of those already at hand (like administrative 
data) require new tools and methodologies. The response to these thematic, method- 
ological and organizational challenges lies in “integration”: of sources, of methods, 
and of skills. Multiple use of data sources should be based on a re-engineering of 
the production process of official statistics. At Istat, the core of the new organisation 
aims at moving away from the ‘silo’ approach, typical of traditional statistical agen- 
cies, towards the enhancement of horizontal services: management, methodology 
and IT innovations drive the integration process, linking sources to boost coherence, 
tailoring new products to the different users’ needs, reducing the response burden 
through the reuse of available data and information, increasing the use of technol- 
ogy, and resulting in significant efficiency and time saving. This new organisational 
model supports the Integrated System of Statistical Registers, a single logical data 
asset resulting from the integration of survey data, administrative data as well as 
data coming from new sources. Pillars of this system are the Population Register, 
the Business Register and the Territorial Register which are interconnected with 
one another through the Activity Register. The Integrated System allows achieving 
units and variables identification and estimation consistency as single cohesive units, 
which will make several new analyses possible (including a longitudinal approach). 
Hence, the system will not only improve efficiency by means of economies of scale, 
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but also high quality and richer statistical outputs. The path towards integration 
cannot leave the research activity of the Institute aside: research and development 
of new techniques and methodologies are indeed at the center of Istat’s moderni- 
sation project. Istat has set up a three-year plan for methodological and thematic 
research. Methodological research will move along four strategic research areas: 
integrated system of registers; censuses obtained by data integration; big data; and 
the unique process. Innovation will emerge from the so-called “Innovation lab”, a 
place where researchers will share new ideas, test new solutions, new processes and 
new products. The frontier of data integration for official statistics is represented by 
the increasing opportunities to use big data to produce timely high-quality statis- 
tics with greater detail, and competing with growing numbers of new, non-official, 
players. NSIs are compelled to speed their production and make it more effective 
and less burdensome for respondents. In Istat, as in most NSIs in Europe, several 
projects using big data sources for the production of statistics are currently ongo- 
ing. Some projects are in the early stages of implementation, some other are still 
in the experimental phase. We expect to reap the first results in the near future. All 
those projects need to tackle three key issues: quality, privacy and security issues, 
partnerships. The use of big data in the production of official statistics is part of a 
wider strategy on “Experimental statistics” including new indicators from integra- 
tion of sources, new tools for new phenomena, and unconventional classifications. 
Outputs from these innovative work will need to be treated accordingly. Data dis- 
semination is another key issue in official statistics’ innovation with the progressive 
opening of our data at its core. Open data are a key enabler of data driven innova- 
tion. When official statistics meets open data, several benefits are generated: from 
the possibility to reach users more easily to the enrichment of the published in- 
formation with metadata that allow a proper interpretation. Much has already been 
done in the last years to disseminate them. Through the Linked Open Data portal 
users can now access interconnected and structured information through graphical 
interfaces that can be directly queried by external applications, independently of the 
technologies adopted. Finally, on the way towards innovation, high level skills and a 
change-driven culture are essential. Statistical institutions need data specialist able 
to produce, integrate and interpret data and to work with big and open data, but such 
skills are also strongly requested by the market, everywhere in Europe. Italy needs 
to engage further to urgently foster these new professions. Statistical organisations 
can also benefit greatly from each other and from mutual exchange and support and 
networking. 


A fast algorithm for the canonical polyadic 
decomposition of large tensors 


Un algoritmo veloce per la decomposizione di grandi 
tensori 


R. André, X. Luciani and E. Moreau 


Abstract The canonical polyadic decomposition is one of the most used tensor de- 
composition. However classical decomposition algorithms such as alternating least 
squares suffer from convergence problems and thus the decomposition of large ten- 
sors can be very time consuming. Recently it has been shown that the decomposi- 
tion can be rewritten as a joint eigenvalue decomposition problem. In this paper we 
propose a fast joint eigenvalue decomposition algorithm then we show how it can 
benefit the canonical polyadic decomposition of large tensors. 

Abstract La decomposizione canonica di tensori é usata in diversi campi tra cui 
quello del data science. Tuttavia, nei classici algoritmi di decomposizione, come 
l’alternating least squares, si possono riscontrare problemi di convergenza. Pro- 
prio per questo motivo, la decomposizione di grandi tensori può essere molto dis- 
pendiosa in termini di tempo di calcolo. Recentemente, sono stati sviluppati algo- 
ritmi di decomposizione canonica veloci, basati sulla diagonalizzazione di un in- 
sieme di matrici su una base comune di autovettori. In questo articolo proponiamo 
un algoritmo originale per risolvere quest’ultimo problema. In seguito mettiamo in 
evidenza l’aspetto più interessante di questo approccio al fine di effettuare la de- 
composizione canonica di grandi tensori. 


Key words: Tensor, Canonical polyadic decomposition, PARAFAC, algorithms, 
Joint eigenvalues decomposition 
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1 Introduction 


In many data sciences applications, collected data have a multidimensional struc- 
ture and can thus be stored in multiway arrays (tensors). In this context, multi- 
way analysis provides efficient tools to analyze such data sets. In particular, the 
Canonical Polyadic Decomposition (CPD) also known as PARAllel FACtor analysis 
(PARAFAC) has been successfully applied in various domains such as chemomet- 


rics, telecommunications, psychometrics and data mining, just to mention a few [7]. 

The CPD models the data thanks to multilinear combinations as described below. 
Let us consider a data tensor 7 of order Q (i.e. a Q-dimensions array) and size 
I, x +++ x Ig. Its CPD of rank N is then defined by: 


N 
Fy ig a L P FAS) ate Ei, ig a) 
n=1 
where F is the g-th factor matrix of size J, x N and & is the error tensor. One cru- 
cial point here is that this decomposition has usually an unique solution up to trivial 
scaling and permutation indeterminacy. The idea is then that the meaningful infor- 
mation lies in the factor matrices. Thus we want to estimate these matrices from the 
data. Several algorithms were proposed in this purpose. The most popular is the Al- 
ternating Least Squares algorithm (ALS) [6]. This iterative algorithm is very simple 
to implement and usually provides accurate results. However it suffers from well 
known convergence problems. In particular the convergence is very sensitive to the 
initialization and the algorithm can be easily stuck in a local minimum of the cost 
function. A smart initialization is always possible but in practice one had better to 
perform several runs of the algorithm with random initialization. A second conse- 
quence is that it is difficult to set efficiently the threshold of the stopping criterion. 
Indeed it frequently occurs that the algorithm escapes from a local minimum after a 
very large number of iterations and during these iterations the variations of the cost 
function can be very small. This issue becomes significant when the computational 
cost per iteration is high i.e. for high rank CPD of large tensors. Thereby the de- 
composition performed with ALS can have a high effective computational cost and 
thus can be time consuming when dealing with high rank CPD of large tensors. Sev- 
eral other iterative algorithms were proposed to solve those convergence problems 
but in practice the computational cost of these solutions remains high. More details 
about ALS convergence problems and other iterative CPD algorithms can be found 
in [7], [3] and [1]. 

Recently several authors showed how to rewrite the CPD as a Joint Eigen Values 
Decomponsition (JEVD) of a matrix set [5, 8, 10]. The JEVD consists in finding 
the eigenvector matrix A that jointly diagonalizes a given set of K non-defective 
matrices M“) in the following way: 


MA = ADMA!, vk=1,...,K. (2) 


This approach allows to reduce the computational cost of the CPD because JEVD 
algorithms converge in few iterations with an excellent convergence rate. Further- 
more, it is less sensitive to the overestimation of the CPD rank than ALS. 
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Several JEVD algorithms have been proposed in the last decade [4, 8,9]. In a 
recent paper we have introduced an algorithm called JDTE [2]. This algorithm offers 
a good trade-off between speed and precision but its performances decrease with the 
matrix size. In the CPD context, it means that it is not suitable for high rank CPD. 
As a consequence, we propose in this paper an improved version of this algorithm. 

The paper is organized as follow. In the next section we recall a simple and 
economic way to rewrite the CPD as a JEVD problem. Then in section 3 we describe 
the proposed JEVD algorithm. Finally, in section 4 we evaluate our approach for the 
decomposition of large tensors by means of numerical simulations. 

In the following, the operator Diag{-} represents the diagonal matrix built from 
the diagonal of the matrix argument, the operator ZDiag{-} sets to zero the diagonal 
of the matrix argument and ||.|| is the Frobenius norm of the argument matrix or 
tensor. 


2 From CPD to JEVD 


There are several ways to rewrite the CPD as a JEVD problem. Here we use the 
method described in [8] because the associated algorithm, called DIAG, has the 


lowest numerical complexity. 
We consider the tensor 7 and its CPD of rank N defined in introduction. The 
first step consists in rearranging entries of 7 into an unfolding matrix T of size 


Thai tp x Hes +11 by merging the first P modes on the rows of T and the Q — P other 
modes on its columns. Defining for all couple of integers (a,b) with a < b: 


yea - FO) OF0-10...0 FO, (3) 
where © is the Khatri-Rao product, we can thus rewrite (1) in a matrix form: 
Taye) (y(2P+)) T (4) 


Of course, other merging of the tensor modes could have been chosen, leading to 
other unfolding matrices. The choice of the unfolding matrix can have a huge impact 
on the numerical complexity of the DIAG algorithm [8]. As a rule of thumb, when 
all tensor dimensions are large we recommend to chose P = Q — 2 and to place the 
smallest dimension at the end (/g < J;,¥q). In the following we assume that the rank 
of T is not greater than N. 

The second step is the Singular Value Decomposition (SVD) of T, truncated at 


order N. We denote U, S and V' the matrices of this truncated SVD. 
At this stage, there exists a unique non singular square matrix A of size N x N 
such that: 
y?) —UA and (Y(2?+))T = A-!SVT, (5) 


(y‘2-?+1))T can be seen as an horizontal block matrix: 


(y(2P+D)T = i se. gayest): ' (6) 
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where §!),-.. , 2) are the Ig diagonal matrices built from the Jọ rows of matrix FO). 
Then, (5) and (6) yield: 


SV = |r... riley] i (7) 


where Pl) = y(0-1P+1)g() AT for kı = 1,--- Jo. Assuming that matrices PA) and ma- 
trix ¥'2-!?+) are full column rank, then they all admit a Moore-Penrose matrix in- 
verse denoted by #. Thereby, we can define for any couple (k;,k2) with kj = 1,--- ,Jg—1 
and ky = kı +1,- Jo! 


Mikko) de (rere) E (8) 
= ADM 2) ATI, (9) 


where D2) = o(@)6(4) are diagonal matrices. As a result, A performs the JEVD 
of the set of matrices M“) and can be estimated using a JEVD algorithm. An 
important observation have to be made here. In the previous step we have built 
Io (Ig — 1)/2 matrices Ml), When dealing with large tensors this value can be very 
high with respect to the matrix size. In practice this does not help to improve the 
estimation of A significantly and dramatically increases the numerical complexity of 
the JEVD step. Thereby, we propose as an alternative to build only a subset of Ig — 1 
matrices, for instance by taking kz = kı +1 in (8). This can be seen as an economic 
version of the DIAG algorithm. 

After the JEVD, matrices Y!) and y‘2?+) are immediately deduced from A 
using (5). Finally, we can easily deduce F“),.-- ,F® from Y®® as explained in [8]. 

In the next section, we propose an algorithm to solve the JEVD step. In order to 


simplify the notations, subscripts kı and k are replaced by unique subscript k so that 
equation (9) becomes: 


MY =ADA-!, vk=1,...,K, (10) 


where K = Ig(Ig — 1)/2 or K = Iọ — 1 depending on whether we choose the original 
DIAG algorithm or the economic version. 


3 A fast JEVD algorithm 


We propose here a fast algorithm to compute an estimate of A, denoted B, up to a 
permutation and scaling indeterminacy of the columns. This indeterminacy is inher- 
ent to the JEVD problem. 

We want that B jointly diagonalizes the set of matrices M“). It means that matrices 
D defined by: 
Dp” =B MB, Vvk=1,...,.K (11) 
must be as diagonal as possible. B is called the diagonalizing matrix. This kind of 
problem can be efficiently solved by an iterative procedure based on multiplicative 


(A) 


updates. Before the first iteration, we set D“ =M“, then at each iteration, matrices 


Band D” are updated by a new matrix X as follow: 
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(k) (A) 


eX IDX, Vk=1,...,K. (12) 


B & BX and D 


The strategy that we now propose to compute the updating matrix X can be seen as 
a modified version of the one we proposed in [2]. The main difference is that here 
we resort to a sweeping procedure. It means that X is built from a set of N(N — 1)/2 
matrices, denoted x“) (i=1,...,N— 1 and j=i+1,...,N) as follow: 


N 
X= IT [Į x’. a3) 


As a consequence, at each iteration, the updates in (12) consist now in N(N — 1)/2 
successive (i, j)-updates of B and bp”, defined as: 


Be BX) and DY & (x) DONI, yea 1,...,K. (14) 


Furthermore, because of the scaling indeterminacy of the JEVD problem we can 
impose the following structure to matrices X“): x‘ is equal to the identity matrix 
at the exception of entries x ) and xe ) that are equal to two unknown parameters: 

KE Sh (15) 
a (16) 


We now explain how these parameters are computed for a given couple (i, j). First 
of all, let us define the function C as 


iy. Po Gi) BAG 2 
cX )= È ||2Diag{ (x ) D’x")} IP. (17) 
k=1 


C is a classical diagonalization criterion that is equal to zero if the K updated matri- 
ces are diagonal. Therefore we look for X“-/) that minimizes C. 

Matrix XË can be decomposed as XË) = (1+ Z)), where Zl = ZDiag{x}. 
The criterion can then be written as: 


K Aa yoi 
C(x") = CZE) = Y ||ZDiag{ 1+ Z)) "BD 14. 2} P. (18) 
k=1 


We consider in fact an approximation of C(x“-/)) assuming that we are close to the 
diagonalizing solution i.e. X‘ is close to the identity matrix. This implies that 
|Z) || <1 and thus the first order Taylor expansion of (1+ Z/))~! yields: 


420))- D4 zl) ~ dz) a4. ze) (19) 
~ dA zip” DOZE) Zap” ZG) (20) 
£ p” _ zl.) pf dz (21) 


(k) 


In the same way, matrices D can be decomposed as D = A +0, where A® = 


Diag{D®} and 0 = ZDiag{D"}. Here our assumption means that matrices D® are 


almost diagonal and thus ||0|| < 1. It yields: 
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DA -ZDP DOZO AM 40% — ZAM + AZ (22) 


and finally we can approximate C(Z“) by C,(Z)): 


K 
Ca(Z“)) = Y \|ZDiag{o™ _ Zi A) + AM ZEA? (23) 
k=1 
K N 
= Y L (o% + Zm AÑ) S Zm AW)? (24) 
k=1m,n=1 
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(25) 
We can then easily show that C, is minimum for : 
K (k) (4 (k) (k) K (k) (4 (k) (k) 
(lid x) Vial Oi; (Ajj Ai) Lia Oji (Ajj —Ajj ) (26) 
, k k , k k 
VE - A? EAG -4P 


We call this algorithm SJDTE for Sweeping Joint eigenvalue Decomposition 
based on a Taylor Expansion. 


4 Numerical Simulations 


We have included SJDTE in the DIAG procedure described in section 2 for the 
JEVD step. Two versions of this CPD algorithm were implemented corresponding 
to original and economic versions of DIAG. In the following, these are referred 
as DIAG-SJDTE and DIAG-SJDTE-eco respectively and are compared with the 
ALS for the CPD of large tensors of order 3. In this purpose we define the tensor 
reconstruction error: rr = |7 — F||/||Z || and the factor matrices estimation error: 
rr = Y3_, ||FO - FO || /\|F|| where Fl”, F and F” are the factor matrices estimated 
by an algorithm and 7 is the tensor reconstructed from these matrices. Other com- 
parison criteria are the number of computed iteration, ną and the cputime of Mat- 
lab (elapsed time during the algorithm run), tepu. Of course the cputime strongly 
depends on the implementation of the algorithms and for this reason the computa- 
tional cost might be preferred. However in the present case, numerical complexities 
of compared algorithms involve subroutines such as truncated SVDs whose numer- 
ical complexity are hard to evaluate with precision. Furthermore, all the algorithms 
compared in this section were carefully implemented in house in Matlab language 
and optimized for it. 

We have implemented two versions of ALS. In the first version (ALS-1), the 
ALS procedure is stopped when the relative difference between two successive val- 
ues of rr is lower than 10~* or when n; reaches 50. In the second version (ALS-2), 
we set these two values to 1078 and 200. SIDTE is stopped when the relative dif- 
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Table 1 Average reconstruction error, estimation error, number of iterations and cputime. 


Algorithm Scenario 1: N=5 | Scenario2:N=15 | Scenario 3 : N = 30 

rr | TF [rir | tepu 'T | rF [ri |tepu(5) 'T | TF |nit| tepu 

ALS-1 0.2 | 0.35 |7 [3.41| 0.22 | 0.43 |10| 7.02 | 0.2 | 0.42 [13|17.5 
ALS-2 0.16| 0.29 |10|14.5| 0.17 | 0.34 |16| 52.8 | 0.16 | 0.34 |21|167 
DIAG-SIDTE  |0.01]0.0006] 4 |5.78]0.012]0.005) 5 | 38.9 |0.014/0.0162| 5 | 275 
DIAG-SJDTE-eco|0.01|0.0016| 4 |4.36[0.013|0.006| 5 | 7.36 |0.017] 0.019 | 5 | 14 
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Fig. 1 Distributions of the estimation error for the four algorithms according to the value of N. 


ference between two successive values of C is lower than 10-3 or when n; reach 10. 
Comparisons are made by means of Monte-Carlo (MC) simulations. For each MC 
run, three new factor matrices of size 200 x N are randomly drawn from a normal 
distribution and a new tensor is built from the CPD model. A white Gaussian noise 
is then added to its entries in order to obtain a signal to noise ratio of 40 dB. Then 
the four algorithms are run to compute the CPD of rank N of the noisy tensor. We 
distinguish three scenarios according to the chosen value of N: N = 5 (scenario 1), 
N = 15 (scenario 2) and N = 30 (scenario 3). For each scenario, average values of 
rr, TF, ni and tepu are computed from 1000 MC runs. Results are reported in table 1 
for each algorithm. In order to have a more precise idea of the convergence rate of 
the algorithms, we show in figure 1 the distribution of rp in the ranges [10-4;1073[, 
[10-3;10-?[, [10-7; 10-1 and [10-!;1[. Convergence problems of ALS clearly appear 
from these results. Whatever the considered scenario, the average value of rr and of 
re remains high for both ALS-1 and ALS-2. Figure 1 shows that ALS behavior is 
binary. For instance in the first scenario (N = 5), less than 60% of the values of rp fall 
in the range [10-4; 10-3[ and all the other values are greater than 107}. Moreover, the 
proportion of rr values below 10~! dramatically decreases with N: 25% for N = 15 
and 7% for N = 30 with ALS-2. In these conditions, cputimes of ALS-1 are very low 
and compete with those of DIAG but considering the previous observation about 
the convergence rates, these values are misleading. Indeed, in practice ALS should 
be run from different starting point in order to obtain satisfying convergence hence 
increasing the total cputime. Furthermore, comparing ALS-1 and ALS-2 results, it 
appears that decreasing the threshold of the stopping criterion had little impact on 
the convergence. Conversely, DIAG-SJDTE and DIAG-SJDTE-eco offer good re- 
sults in term of average reconstruction and estimation errors. This is mainly due 
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to good convergence rates as it can be seen on figure 1, with a little advantage for 
DIAG-SJDTE. In addition, these performances are quite stable with respect to the 
value of N. For instance for N = 30 more than 80% of rr values are still lower than 
107?. Now, regarding the average cputime, DIAG-SJDTE-eco is very less time con- 
suming than DIAG-SJDTE for N = 15 (7s against 39s) and N = 30 (14s against 2755) 
whereas the average iteration numbers of both algorithms is the same. Considering 
the small difference between both algorithms regarding r criterion, we can thus 
clearly recommend the use of DIAG-SJDTE-eco when N is large. 


5 Conclusion 


We have proposed an original JEVD algorithm and showed how it can help for com- 
puting the canonical polyadic decomposition of large tensors. Preliminary results 
showed in this work point out that this approach provides very good convergence 
rates comparing to a reference CPD algorithm. Moreover it converges in very few 
iterations and the computing times are very low, including for high rank CPD. Fur- 
ther studies will be conducted to refine this conclusion. In particular, we want now 
to evaluate the impact of the choice of the subset of matrices M® and of the JEVD 
algorithm inside the DIAG procedure. 
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On the use of Google Trend data as covariates in 
nowcasting: Sampling and modeling issues 


L’utilizzo dei dati Google Trend come covariate per il 


nowcasting: problemi di campionamento e modelizzazione 
M.Simona Andreano, Roberto Benedetti, Paolo Postiglione, Giovanni Savio 


Abstract The use of Big-data, and more specifically of Google Trend data, in now- 
and forecasting, has become common practice, even by Institutes and Organizations 
in charge of producing official statistics around the world. However, such data will 
have many implications in the model estimation, which can roughly impact final 
results. In this paper, starting from a MIDAS-AR model with Google Trend 
covariate, we are focussing on the main issues concerning the sampling error and the 
time domain context. 


Abstract L'uso di Big-data, e più nello specifico di Google Trend data, è divenuto 
prassi comune nell’ambito delle previsioni e del nowcasting, anche per gli istituti 
nazionali ufficiali. In realtà, il ricorso a tali dati pone diverse problematiche 
nell’ambito della stima del modello, che possono avere rilevanti ripercussioni sul 
risultato finale. Nel presente lavoro, partendo dalla stima di un modello MIDAS-AR 
con covariata Google Trend, si sono affrontate le principali problematiche 
riguardanti l’errore campionario e le specificità nel dominio delle serie storiche. 
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Key words: Google trend, MIDAS model, repeated survey. 


1 Introduction 


The emergence of Big Data, and their capacity to help in now-casting, forecasting, 
disaggregating, and filling in gaps of conventional statistics data sources, is now 
history. However, the use of Big Data in economic now-casting and forecasting is 
still full of open questions, which concerns, among others: 

(a) representativeness of Internet data sources; 

(b) the synthesis of the information contained in the data; 

(c) the presence of non stationarity and seasonality 

(d) the estimation methods and modelling for disaggregating purposes; 

(e) the now- or forecasting model evaluation. 


In the present paper we focus on the Mixed Data Sampling Models (MIDAS) 
with Google Trend as covariate to now- and forecasting a target variable y; h-step 
ahead, where the lowest frequency series y+ is regressed on the higher frequency one, 
through a distributed lag operator: 


= Br l/m, (m) (m) 
Yerma = Yi; thy, Ê + BB(L 0x" -® + Et ty 


and B(L' ":0)= 5“ Bk; OL denotes a weighting function, t indexes the basic time 
A m 


unit, m is the frequency mixture and ø is the number of values of the indicators that 
are available earlier than the lower-frequency variable to be estimated. 

In the next Section we will summarize the main issues arising from the estimation 
of such a model, with the purpose to highlight the open questions coming from the 
use of Big Data in empirical applications. Some final remarks will conclude the 


paper. 


2 Modeling and sampling issues 


Google Trend provides an index of the relative volume of search queries conducted 
through Google, and provides aggregated indices of search queries, which are 
classified into a total of 605 categories and sub-categories using an automated 
classification engine. Choi and Varian (2006, 2012) first showed the relevance of 
such data in predicting consumer behavior and initial unemployment claims for the 
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US. In our MIDAS-AR model the covariate x, is the weekly query index of Google 
Trend, and it is used to nowcasting the (monthly) target variable y. For further 
properties on MIDAS models we refer to Ghysels et al. (2006a, 2006b). 

First of all, we have to highlight, that Internet data sources, like Google Trend 
data, are not a probabilistic sample, but a self-selected sample created by the Internet 
users. Therefore, it can be a systematic bias in the sample of Internet people and if 
we ignore these biases and assume they will be resolved through sheer sample size, 
we compromise the utility of the findings of our research. For example, in the 
Google search of job, the Internet users will probably be young people, leaving in 
big cities. These biases could be considerable, when the forecasting model is applied 
for getting more granular information of the phenomena, and all aspects are 
disaggregated with respect to the geographical location, sex, group, sectoral activity 
etc.. This typical problem in the new era of big data arise because investigators are 
more likely to be separated from the data collection process and have less intimate 
knowledge about the texture and the quality of the many elements in their datasets. 

The use of Google trend data in a time series domain causes other more specific 
shortcomings. The target variable y, should be observed through a panel survey, 
where the same individual provides responses on repeated occasions. This is the case 
for Unemployment from Labor Force survey or for Consumption from Household 
survey, variables often forecasted through Google trend data (Askitas and 
Zimmermann, 2009; D’Amuri, 2009; Fondeur and Karamé, 2012; Schmidt and 
Vosen, 2010). In these cases, the sample overlap induces a correlation structure in 
the sampling errors of the time series of estimates, which affects the analysis of them. 
Estimators that ignore these correlations are generally inefficient relative to the 
minimum variance linear unbiased estimator (MVLUE). Bell and Wilcox (1993) 
assessed the sensitivity of parameter estimates for time series models of retail sales 
data to the treatment of sampling error, through the application of ARMA models on 
the sample error, while Binder and Dick (1989) used state-space models. More 
recently, Steel and McLaren (2009) examined the interaction between the design of a 
repeated survey and the methods used for estimation and reviewed the different 
forms of estimators. 

The Google trend covariate x; can also be seen as a time series drawn from a 
repeated survey, where the design is unknown. The same Internet-users will 
supposedly make their search repeated in the short term, therefore we have an 
overlap with adjacent days (weeks) observations that induces correlation. 

Both these sampling error structure, coming from y; and x; should be properly 
being considered in our model estimation. 

Always looking at the issues on the estimation of time series with MIDAS model, 
we need to spend some words on the presence of trend and seasonality in our 
variables y; and x; To this regard, we outline two different questions: one arising 
from the aforementioned problem of sampling error in repeated survey observations, 
and the other from structural characteristics in the trend of Internet data. 

In the first case, we note that the sampling errors can have important effects on 
the seasonal autocorrelation properties of the observed time series, and if we don’t 
remove the survey error component, we may end up with spurious correlations as 
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part of our time series estimates, affecting the trend and seasonal components (Bell 
and Wilcox, 1993). Ignoring (seasonally correlated) sampling error can make 
seasonality appear much more variable than it appears, when the sampling error is 
accommodated in the model. The application of different seasonal adjustment 
procedures (X11-ARIMA or TRAMO-SEATS) will affect in a different way the 
final estimates of the model, with a lower bias when the model based procedure is 
applied. However, the identification of seasonality in the weekly Google trend data is 
not immediate and different seasonal frequencies may overlap on each other. 
Methods for producing variance estimates for seasonally adjusted and trend 
estimates have been considered by Wolter and Monsour (1981) and Pfeffermann 
(1994) and reviewed by Scott et al. (2005). 

Secondly, we note that Google Trend data are proposed as a weekly query index 
that indicates the percentage deviation from the date to which the data are 
normalised and the series go back at most to 2004. However, the Internet users from 
2004 until nowadays are significantly increased and the trend observed over this 
period should be affected by this tendency. Therefore, spurious trend relationships 
could occur when dealing with Google Trend data. One solution may be to remove 
the trend from the series and applying the MIDAS model on stationary time series y; 
and x;. 

Until now we have not dealt with the problem concerning the synthesis of the 
information contained in the Internet data, because, in our case, we have only one 
Google trend covariate x, and the topic is to wide to be exhaustively discussed here. 
We only note that, also in our case, before to choose the appropriate Google trend 
covariate to insert in the model, we needed to explore many different query options 
and a large-scale data analysis should be made. As pointed by Fisher et al. (2013), 
most data analytics developed for standard data reduction process may not be able to 
be applied directly to big data and there exist different efficient methods to solve the 
dimensionality problem: sampling, data condensation, density-based approaches, 
incremental learning, machine learning techniques, boosting, bagging, etc. In 
addition to the issues of data size, Laney (2001) presented a well-known definition 
(also called 3Vs) to explain what is the “big” data: volume, velocity, and variety. 
The 3Vs imply that the data size is large, will be created rapidly, and will be existed 
in multiple types and captured from different sources. These three characteristics 
strongly influence the choice of the appropriate reduction technique to apply on the 
data. 

The majority of traditional forecasting techniques that perform relatively well in 
the case of standard data sets, are more likely to distort the accuracy of forecast 
when applied on Big data, because of the presence of high noise in Big data series. 
This suggests that there is a need for employing and evaluating the use of forecasting 
techniques, which can filter the noise in Big Data and forecast the signal alone. With 
Big data, there is an increased complexity in differentiating between randomness and 
statistically significant outcomes, as there is an increased chance of reporting a 
chance occurrence as a Statistically significant outcome and misleading the 
stakeholders interested in the forecast. Nonlinear and non standard model are more 
appropriate when forecasting with Big data (and Google Trend data). 
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3 Conclusions 


In the present paper we made an overview of the several problems arising from the 
estimation of a time series model with Google trend covariate, focussing on the main 
issues concerning the sampling and the time domain context. Thereafter we note a 
set of key challenges that at present hinder and restrict the accuracy and effectiveness 
of Big Data forecasts. Many questions are still open and a more depth analysis of the 
estimations troubles should be faced up, to avoid misleading forecast outcomes. 
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A spatial decomposition of the change in urban 
poverty concentration 


Una scomposizione spaziale della variazione nella 
concentrazione della poverta urbana 


Francesco Andreoli and Mauro Mussini 


Abstract This paper explores the change in the concentration of poor individuals in 
the neighborhoods of a city, taking into account neighborhood locations on urban 
map. Urban poverty concentration is measured and the change over time in urban 
poverty concentration is broken down into different components. Each of these com- 
ponents is further split into spatial components explaining the extent to which spatial 
dependence affects the change in urban poverty concentration. 

Abstract L’articolo indaga la variazione nella concentrazione dei poveri nei quar- 
tieri di una citta, considerando le posizioni dei quartieri sulla mappa urbana. Si 
misura la concentrazione della poverta urbana e si scompone la sua variazione 
nel tempo in diverse componenti. Ciascuna di queste componenti é divisa nelle sue 
componenti spaziali, che spiegano quanto la dipendenza spaziale influisca sulla va- 
riazione nella concentrazione della povertà urbana. 


Key words: Administrative data, Decomposition, Poverty, Spatial Inequality 


1 Introduction 


The growing availability of administrative data enables to develop research on 
poverty at a finer level of territorial disaggregation. When information on poverty 
status (poor or non-poor) for residents of neighborhoods in a city is complemented 
by spatial neighborhood information, the analysis of poverty distribution across 
neighborhoods can be linked with the analysis of spatial dependence in the dis- 
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tribution of poverty in the city. This paper focuses on urban poverty concentration, 
which is measured by means of the Gini index. The change over time in the Gini in- 
dex of urban poverty is broken down into three components, explaining the roles of 
changes in population proportions of neighborhoods, re-ranking of neighborhoods 
and changes in disparities between neighborhood poverty rates. Each component 
of the change in urban poverty concentration is further split into spatial compo- 
nents, separating the contribution of changes occurred between neighboring neigh- 
borhoods and that of changes occurred between non-neighboring neighborhoods. 
This decomposition over time and space is used to analyze the change in urban 
poverty concentration across the census tracts in the City of Los Angeles. 


2 Concentrated poverty 


The spatial distribution of poor people within urban space is considered, with urban 
space partitioned into n administrative units. The n administrative units detected by 
a space partition are referred to as neighborhoods. A city is hence partitioned into 
n neighborhoods. For instance, neighborhoods may be defined at the census tract 
level, however the setting can be extended to other geographic levels. Every indi- 
vidual living in a neighborhood is assigned with a poverty status (poor or non-poor) 
according to the fact that his income is below or above a poverty line. In this frame- 
work, a urban poverty configuration is a collection of counts of residents and poor 
residents across the city neighborhoods. 

The concept of urban poverty concentration is here linked with the fact that poor 
residents tend to be distributed disproportionately across neighborhoods. It is a rel- 
ative concept that can be expressed by comparing the distribution of poor popu- 
lation shares across neighborhoods with the distribution of population proportions 
across the same neighborhoods. Hence, one does not value the fact that in one urban 
poverty configuration there are more poor individuals than in another, but rather that 
the proportion of poor individuals in the population is larger and less evenly spread 
out across neighborhoods in one configuration compared to another. To measure the 
degree of concentration of poor individuals across neighborhoods, the Gini index is 
used. The Gini index of urban poverty is expressed by applying the matrix formula- 
tion of the Gini index suggested by Mussini and Grossi [2] and further developed by 
Mussini [1], as this matrix expression is useful to decompose the change over time 
in the index and to measure the spatial components of this change. 

Let p = (p1,..., Pn)" be the n x 1 vector of neighborhood poverty rates sorted in 
decreasing order and s = (s1,... Sn)? be the n x 1 vector of the corresponding pop- 
ulation shares. 1, being the n x 1 vector with each element equal to 1, P is then x n 
skew-symmetric matrix: 
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where p is the overall poverty rate in the city. The elements of P are the n? relative 
pairwise differences between the neighborhood poverty rates as ordered in p. Let 
S = diag {s} be the n x n diagonal matrix with diagonal elements equal to the pop- 
ulation shares in s, and G be a n x n G-matrix (a skew-symmetric matrix whose di- 
agonal elements are equal to 0, with upper diagonal elements equal to — 1 and lower 
diagonal elements equal to 1) [4]. The Gini index of urban poverty is expressed in 
matrix form: 


G(s,p) = lr (GP), (2) 


where the matrix G = SGS is the weighting G-matrix, a generalization of the G- 
matrix introduced by Mussini and Grossi [2] to add weights in the calculation of the 
Gini index. 


3 Decomposing changes in urban poverty concentration 


Suppose that poverty rates and population shares of n neighborhoods are observed in 
times ¢ and + 1. Let p, be the n x 1 vector of the t poverty rates sorted in decreasing 
order and s; be the n x 1 vector of the corresponding population shares. Let p,,; be 
the n x 1 vector of the f+ 1 poverty rates sorted in decreasing order and s;+1 be the 
nx 1 vector of the corresponding population shares. The change in urban poverty 
concentration from ? to t + 1 is measured by the difference between the Gini index 
in t+ 1 and the Gini index in t: 


1,4 1 z 
AG = G(s;41,P;+1) — G(S;,P;) = 30” (Gru Pi) — xt (G,P/) . (3) 


Equation 3 can be broken down into three components explaining the roles of 
changes in population shares, ranking of neighborhoods and disparity of poverty 
rates. Let p,,1), be the n x 1 vector of t+ 1 neighborhood poverty rates sorted 
in decreasing order of the respective t neighborhood poverty rates, and B be the 
n x n permutation matrix re-arranging the elements of p,,, to obtain P, +y, that is 
Prt}; = BP,+1- Let A = P:+1/P;+1, be the ratio of the actual 1 + 1 overall poverty 
rate to the fictitious f+ 1 overall poverty rate which is the weighted average of t + 1 
poverty rates where the weights are the corresponding population shares in t. Matrix 


Pri ij = (1/B; +1) (uP?) = Pray 17) contains the n? relative pairwise differ- 
ences between the neighborhood poverty rates as arranged in p,1. Applying the 
Mussini and Grossi decomposition [2], the change in urban poverty concentration 
between 1 and 1 + 1 is split into three components: 
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AG= str (WP?) + str (RAP) — str (G.D")=W+R-D, (4 


where W = Ĝ, 1 — AG i441 „R= Gri —B’G,B and D =P, —P,.1),- Component 
W measures the effect of changes in the population shares of neighborhoods. A 
positive value of W indicates that the weights assigned to more unequal pairs of 
neighborhoods are larger in t + 1 than in f, increasing urban poverty concentration 
from ¢ tot + 1. A negative value of W indicates that the weights assigned to more 
unequal pairs of neighborhoods are smaller in t + 1 than in f, reducing urban poverty 
concentration from ? to t+ 1. Component R measures the effect of re-ranking of 
neighborhoods from ¢ to t+ 1 and its contribution to the change in urban poverty 
concentration is always non-negative. The nonzero elements of R detect the pairs 
of neighborhoods which have re-ranked from ż to t + 1. Component D measures the 
effect of disproportionate change between neighborhood poverty rates. The generic 
(i, j)-th element of D compares the relative difference between the t poverty rates of 
the neighborhoods in positions j and i in p, with the relative difference between the 
t+ 1 poverty rates of the same two neighborhoods in p,,, |, . A positive value of D 
means that relative disparities in poverty rates have overall decreased from ż to t + 1, 
reducing urban poverty concentration. A negative value of D indicates that relative 
disparities in poverty rates have overall increased from ¢ to t+ 1, increasing urban 
poverty concentration. If all neighborhood poverty rates have changed by the same 
proportion from ż to t + 1, then D=0. 


3.1 The spatial components of urban poverty concentration 


The components of the change in urban poverty concentration quantify the impacts 
of different distributional changes on urban poverty, however they do not explain 
the extent to which these changes have occurred between neighborhoods which are 
geographically close or not. The spatial location of neighborhoods is neglected by 
AG, W, R and D as they would remain the same if neighborhoods exchanged their 
positions on urban map. The spatial components of AG, W, R and D can be sepa- 
rated by using the approach suggested by Rey and Smith [3] to decompose the Gini 
index into a neighbor component of inequality and a non-neighbor component of 
inequality. 

Let N, be the n x n binary spatial weights matrix having its (i, j)-th entry equal to 1 
if and only if the (i, j)-th element of P, is the relative difference between the poverty 
rates of two neighboring neighborhoods, otherwise the (i, j)-th element of N; is 0. 
Using the Hadamard product," the relative pairwise differences between the poverty 
rates of neighboring neighborhoods can be selected from P;: 


Pr, =N, OP.. (5) 


! Let X and Y be k x k matrices. The Hadamard product X © Y is defined as the k x k matrix with 
the (i, j)-th element equal to x;jyij. 
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For each pair of neighborhoods, the relative difference between the t + 1 poverty 
rates of the two neighborhoods in P;,; has the same position as the relative differ- 
ence between their ¢ poverty rates in P,. Thus, N, also selects the relative pairwise 
differences between neighboring neighborhoods from P,., 1}: 


Py stile =N OPi (6) 


Since D = P, — P,+1);, the Hadamard product between N, and D produces the ma- 
trix with nonzero elements equal to the elements of D pertaining to neighboring 
neighborhoods: 


Dy = Pry -Pri = N © (P, — Pri iy) =N OD. (7) 


Py +1 being the n x n matrix whose nonzero elements are the relative pairwise dif- 
ferences between the t + 1 poverty rates of neighboring neighborhoods, the decom- 
position of the change in the neighbor component of urban poverty concentration is 
obtained by replacing P,+ and D in equation 4 with Py ,,; and Dy respectively: 


1 1 rat 
AGy = xt (WPi 1) + z" (RAPA 1) — zt" (G,Dj)) =Wy +Rv—Dy. (8) 


P,vz+1 and D,y being the matrices with the relative pairwise differences be- 
tween non-neighboring neighborhoods, the decomposition of the change in the non- 
neighbor component of urban poverty concentration is obtained by replacing P;+1 
and D in equation 4 with P,y+1 and Dry respectively: 


1 1 la 
AGny = 5tr (WPins41) + 317 (RAPivs41) — 317 (Gay) = Wan + Ran — Daw. 
(9) 


Given equations 8 and 9, the decomposition over time and space is 


AG = AGy + A Gnn = Wy + Wan + Rw + Ran (Dy H Dny) =W +R—D. (10) 


4 Application 


The decomposition is used to analyze the change in urban poverty concentration in 
the City of Los Angeles from 1980 to 2014. The administrative units are the census 
tracts [5]. For each census tract, poverty line is known in both 1980 and 2014. To 
check for spatial autocorrelation in poverty distribution across census tracts, the Rey 
and Smith test based on random permutations is applied [3]. The hypothesis of ran- 
domness in poverty distribution is rejected in both 1980 and 2014.” Table 1 shows 
the spatial decomposition of each component of the change over time in the Gini 
index of urban poverty. Most of urban poverty concentration is explained by the 


2 The pseudo p-value obtained from 99 permutations is equal to 0.01 in both 1980 and 2014. 
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Table 1: Decomposition over time and space, Los Angeles, 1980-2014. 


component] G2014  G19g0 AG WwW R D 

N 0.10111 0.10991 —0.00880 —0.00145 0.02405 0.03140 
nN 0.27417 0.30090 —0.02674 —0.00314 0.05852 0.08212 
total 10.37527 0.41082 —0.03554 —0.00459 0.08257 0.11352 


disparities between poverty rates of non-neighboring census tracts in both 1980 and 
2014, as the non-neighbor component of the Gini index of urban poverty overcomes 
the neighbor component. Urban poverty concentration decreases from 1980 to 2014. 
The decrease of disparities between poverty rates (0.11352) plays a major role in re- 
ducing urban poverty concentration, however its equalizing effect is partially offset 
by the impact of re-ranking (0.08257). The change in the relative frequency distri- 
bution of population across census tracts reduces urban poverty concentration, but 
it plays a minor role in the reduction of urban poverty concentration (—0.00459). 


5 Conclusion 


A decomposition of the change over time in urban poverty concentration is shown. 
The decomposition links inequality in poverty distribution across city neighbor- 
hoods with the spatial dependence in poverty distribution. The decomposition ex- 
plains the roles of changes in population distribution across neighborhoods, re- 
ranking of neighborhoods and changes in disparities between neighborhood poverty 
rates. Each component of the change in urban poverty concentration is broken down 
into spatial components. 
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How green advertising can impact on gender different 
approach towards sustainability 


L’impatto della “pubblicita verde” sul diverso approccio 
di genere alla sostenibilita. 


Margaret Antonicelli, Vito Flavio Covella 


Abstract 
In the last decades, concerns on protection of the environment have really increased 


among consumers. Initially, people were interested in discovering main 
environmental problems but, actually, consumers have started to exercise their 
decision making process in the purchase of products. 

Performing a first pre-test and subsequently the final analysis, the probit model study 
analyses both the statistical/econometric and the substantive significance of gender 
differences in customer expectations, considering the “effect” of a green advertising. 
This model is estimated jointly with an ordered probit model analyzing the 
magnitude of this different gender approach. Results of the joint estimation and the 
conventional single equation ordered probit model were presented for comparison. 


Abstract 
Negli ultimi decenni, le preoccupazioni in materia di tutela dell'ambiente sono 
notevolmente aumentate tra i consumatori. Inizialmente, le persone erano 


interessate a scoprire i principali problemi ambientali, ma, in realtà, i consumatori 
hanno iniziato ad esercitare il loro processo decisionale per l'acquisto di prodotti. 
Effettuando prima un pre test e poi l’analisi finale, il modello probit utilizzato in 
questo studio analizza sia da un punto di vista statistico/econometrico che 
sostanziale le differenze di genere nelle aspettative dei clienti, considerando l’effetto 
della “una pubblicità verde”. Il modello probit ordinato sottolinea la grandezza di 
questo diverso approccio di genere. I risultati della stima congiunta e della singola 
equazione del modello probit convenzionale sono risultati fondamentali per 
effettuare il confronto. 

Key words: Green advertising, gender difference, probit model, econometric approach 
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1 Introduction 


Although much has been written about sustainable durable goods consumption in the 
last few decades, obtaining reliable information on consumer preferences for new 
social/ethical and eco-labeled products can be an arduous task. 

The literature on business ethics, corporate social responsibility and sustainability 
includes many studies on gender differences, however the results are often 
contrasting. In particular, there has not yet been full agreement on the role and 
significance of gender differences in customer expectations and perceptions of 
responsible corporate conduct. The current study analyses both the statistical and the 
substantive significance of gender differences in customer expectations and 
perceptions of corporate responsibility, also examining the influence of age and 
education!. In particular, the purpose of this study was to understand how male and 
female consumers differently evaluate sustainability claims from brands and how 
brands’ sustainability efforts and the presence/absence of information transparency 
in the claims affect their brand schemas differently. 


2 Literature review and hypothesis development 


Recently, societies have become more concerned about environment protection? . As 
a result, many consumers are modifying their consumption practices, choosing 
products with reduced environmental impacts?. Sustainable consumption, in this 
study, refers to the purchase and use of products with lower environmental impacts* 
and that result in pro-social behaviours?. The first big observed phenomenon is the 
change concerning conscious consumer attention: if once they were more interested 
in the sustainability of products, today, the green community looks first at 
sustainable approach of companies and only later, to the products. 

These observations underline the importance of communication in sustainable 
reputation creation process: in fact, in order to consider a sustainable company, it is 
important that "the company communicates in a transparent way to the consumer"°. 
Reputation that is fundamental when choosing the product to buy. 

In this way, communication and reputation have a primary role in sustainability. 
According to past research, sustainable consumers are mainly female, aged between 
30 and 44 years old, well educated, in a household with a high annual income. 


! Calabrese A., Costa R., Rosati F., Gender differences in customer expectations and perceptions of 
corporate social responsibility 

2 Corraliza and Berenguer, 2000 

3 Schaefer and Crane, 2005 


4 Follows and Jobber, 2000; Pedersen, 2000; Gordon et al., 2011 
5 Diego Costa Pinto, Marcia M. Herter, Patricia Rossi, Adilson Borges, 2014 
6Roveda, 2014 
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Female participants are more likely to engage in sustainable consumption because 
they hold stronger attitudes towards the environment than male participants’. In 
addition, women tend to be more socially responsible, environmentally concerned 
and ecologically conscious than men and tend also to consider the impacts that their 
consumption may cause on others more carefully than men®. Female participants are 
also more willing to change their lifestyle in order to reduce the negative 
environmental impacts of consumption than male participants? and are willing to buy 
and to pay more for an environmentally friendly product than male participants. 
Furthermore, studies showed that the adoption of sustainable practices may depend 
upon reasons beyond conservation of the environment! In this work, we thus 
developed the following hypothesis: high propensity and high knowledge for men to 
buy sustainable durable goods and extensibility of the results obtained in the pre-test 
to the entire sample. 


3. Methodology 


3.1 Data 


The present study analyses both the econometric and the substantial significance of 
gender differences in customer expectations, as well as the perception of corporate 
responsibility, by additionally examining the influence of age and education. The 
work, based on the Italian macro-context, explains sustainable discourses in 
advertising. The aim is to define sustainability broadly and explain the issue of 
inequality, particularly gender inequality, as originating in various forms of 
ascendancy over nature. More specifically, was drawn up a quantitative survey on 
the attitude of Italian citizens towards production systems. A cross-sectional survey 
has been performed to a sample of no less than 1200 units on the entire Italian 
territory. Delving into more detail, the questionnaire employed has been pre-tested to 
reduce error through possible misinterpretation. Regarding the models used, this 
work is based on a comparison of three different methods: Joint estimation of Probit 
and Ordered Probit and Single Equation Model. 


7 Diamantopoulos et al., 2003; Jain and Kaur, 2006 

8 Roberts, 1996b; Mainieri et al., 1997; Straughan and Roberts, 1999; Noble et al., 2006 
9 Abeliotis et al., 2010 

10 Diego Costa Pinto, Marcia M. Herter, Patricia Rossi, Adilson Borges, 2014 
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3.2 Empirical results 


Previous studies have identified a variety demographic and attitudinal characteristic 
that may affect consumer propensity to buy sustainable durable goods. For empirical 
implementation, the explanatory variables of equation are gender, age and education. 
In addition, the importance of level of knowledge and satisfaction about 
sustainability durable goods, frequency of purchase of sustainable durable goods, 
willingness to pay an additional fee to sustainable durable goods and level of 
information available on the sustainability. Specifically, it is expected that female 
would be more attentive to green advertising and would be more willing to buy a 
sustainable durable goods if sustainability were considered as an important attribute 
to making produce purchases. The maximum likelihood estimates of the generalized 
binary- ordinal probit model are presented in Table 1 about entire analysis . For 
comparison, results of the conventional single-equation ordered probit estimation are 
also presented. It is evident from Table 1 that the single-equation model performs 
poorly as compared to the binary- ordinal probit model judging from pseudo-R2s 
that were computed as a measure of goodness-of-fit for to estimated models. 

In general, regarding to the hypothesis, after finding extensibility of the pre-test, 
results showed the main effects of gender and identity on sustainable consumption. 
In particular, this research suggests that female participants will have higher levels of 
sustainable consumption than male participants. This evidence it is verified in all 
three models, observing “gender”. The results provide further evidence for past 
research!!, suggesting that female participants are likely to engage more in 
sustainable consumption than male participants. It is also important to emphasize 
that, not only is the second hypothesis is verified but also in the second model the 
great majority of the independent variables result to be highly significant. 


11 Roberts, 1996a 
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Table 1: Result of Joint estimation of Probit and Ordered Probit and Single Equation Model 


Single 
Joint estimation equation 
estimation 
Variable Probit Ordered ordered probit 
Constant 1.71845 *#** = 2.27266 *** 2.060921 ** 
(-3.801) -1.6015 (-1.349) 
Gender 0.01009 + 0.14198 *** 0.19281 ** 
(-1.099) -0.0465 -0.2014 
Age -0.00593 -0.01102 *** -0.01382 ** 
(-6.012) (-0.4491) (-0.1408) 
Education -0.07423 * 0.07281 *** 0.01545 
(-14.719) -0.2493 (0.3556) 
Level of knowledge about 0.03162 ** 20.31418 *** -0.40419 * 
sustainability durable goods (-0.170) (-0.0728) (-1.3193) 
Willingness to pay an additional -0.05559 sa -0.06562 ** -0.07148 ** 
fee to sustainable durable goods (-1.536) (-1.8640) (-0.5389) 
Frequency of purchase of 0.41152 0.17971 * 0.08281 
sustainable durable goods -12.126 -0.9323 -0.1347 
Level of satisfaction in knowing that -0.19884 + 0.1857 ae 0.11809 * 
the property purchased is 
sustainable (-3.586) -1.3396 -0.1984 
Level of information available -0.53425 * 0.51809 xk 0.44372 ** 
on the sustainability (-4.412) -1.1324 -1.0803 
ul 1.303 a 0.812 * 
-0.5483 (9.821) 
p2 1.91 
-17.787 
6 0.147 
-3.522 
Log likelihood -407.065 -421.681 
Pseudo R2 0.339 -0.245 
Sample 1200 1136 1200 


Numbers in parentheses are t-ratios 


*Indicates statistical significance at the level 0,10, ** 0,005 and *** 0,001 


Discussion 


Despite the large literature regarding to sustainability, it is clear that a higher level of 
information is a key factor for a positive acceptance. It means that sustainability, 
with different communication policies, are able to create a good image, modifying 
consumer behaviour! This study has one important limitation that can guide future 
studies: the sample consisted only Italian participants with access to Internet. 
Although such a sample may have biased the results, it is important to note that, in 


12 Antonicelli M., Calace D., Morrone D., Russo A., Vastola V., 2015 
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Italy, Internet penetration rate is 87,1% (Istat, 2016). Moreover, this could explain 
the predominance of young participants in the sample. Future research could focus 
on another method of data collection in order to consider a different, wider age 
range. This could make possible to investigate whether gender and identity effects 
change in different age groups. In addition, convenience sampling is a limitation of 
the study. Future studies could use representative samples to investigate the effects 
of gender and identities on sustainable consumption. 


References 


14. 


15. 


ANTONICELLI M., CALACE D., MORRONE D., RUSSO A., VASTOLA V., 2015, 
Information or confusion? The role of ecolabels in agrifood sector, Analele Universității din 
Oradea, Fascicula Ecotoxicologie, Zootehnie si Tehnologii de Industrie Alimntarà, Vol. 
XIV/A, pp. 187-195. 

BERKOWITZ L., LUTTERMAN K.G., 1968, The traditional socially responsible 
personality, Public Opinion Quarterly, 32, 169-185. 

CALABRESE A., COSTA R., ROSATI F., 2016. Gender differences in customer expectations 
and perceptions of corporate social responsability, Journal of Cleaner Production, Vol. 116, 
pp. 135-149. 

COSTA PINTO D., HERTER M. M., ROSSI P., BORGES A., 2014. Going green for self of 
for others? Gender and ifdentify salience effects on sustainable consumption, International 
Journal of Consumer Studies, Vol. 38, pp. 540-549. 

CHERRIER H., 2006, Consumer identity and moral obligations in nonplastic bag 
consumption: a dialectical perspective, International Journal of Consumer Studies, 30, 515— 
523. 

CORRALIZA J.A., BERENGUER, J., 2000, Environmental values, beliefs, and actions: a 
situational approach, Environment and Behavior, 32, 832-848. 

HORNE R.E., 2009, Limits to labels: the role of eco-labels in the assessment of product 
sustainability and routes to sustainable consumption, International Journal of Consumer 
Studies, 33, 175-182. 

JAIN S.K., KAUR G., 2006, Role of socio-demographics in segmenting and profiling green 
consumers: an exploratory study of consumers in India, Journal of International Consumer 
Marketing, 18, 107-146. 

LUCHS M., MOORADIAN T., 2012, Sex, personality, and sustainable consumer behaviour: 
elucidating the gender effect. Journal of Consumer Policy, 35, 127-144. 

ROBERTS J., 1993, Sex differences in socially responsible consumers’ behaviour. 
Psychological Reports, 73, 139-148. 

ROBERTS J.A., 1996b, Green consumers in the 1990s: profile and implications for 
advertising, Journal of Business Research, 36, 217-231. 

SALAZAR H.A., OERLEMANS L., VAN STROE-BIEZEN S., 2013, Social influence on 
sustainable consumption: evidence from a behavioural experiment, International Journal of 
Consumer Studies, 37, 172-180. 

SCHAEFER A., CRANE A., 2005, Addressing sustainability and consumption, Journal of 
Macromarketing, 25, 76-92. 

SCHULTZ P.W., 2001, The structure of environmental concern: concern for self, other 
people, and the biosphere, Journal of Environmental Psychology, 21, 327-339. 

STOCK J. H., WATSON M. W., 2011, Introduction to Econometrics, 3/E, Pearson Higher 
Education 


Stratified data: a permutation approach for 
hypotheses testing 


Dati stratificati: un approccio di permutazione per test 
d’ipotesi 


Rosa Arboretti, Eleonora Carrozzo, Luigi Salmaso 


Abstract The present work aims at presenting a general nonparametric alternative 
to the well known van Elteren test for two-sample stratified analysis. We developed 
a solution based on permutation tests that considers the Nonparametric Combination 
(NPC) methodology for reducing the dimensionality of the problem. A simulation 
study to compare performances of proposed test with those of the usual van Elteren 
test and of aligned rank test has been performed considering both continuous and 
ordinal data. Results shows the respect of nominal &-level under Hy even for small 
sample sizes. A real application example is also presented. 

Abstract J! presente lavoro ha l’obiettivo di proporre un'alternativa non para- 
metrica generale al test di van Elteren per analisi a due campioni stratificati. La 
soluzione sviluppata è basata sui test di permutazione e considera la metodologia 
della Combinazione Non Parametrica (NPC) per ridurre la dimensionalità del prob- 
lema. È stato eseguito uno studio di simulazione per confrontare le prestazioni del 
test proposto con quelle del test di van Elteren e del test Aligned Rank, considerando 
sia dati continui che ordinali. I risultati mostrano il rispetto dell’ nominale sotto 
Ho anche per basse numerosità campionarie. È inoltre presentata un’applicazione 
ad un caso reale. 


Key words: Stratified test, Nonparametric Combination methodology, Permutation 
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1 Introduction 


Let us suppose to have two treatments and we are interested in detecting differences 
among their effects. Suppose also that treatments may be influenced by a confound- 
ing factor which is taken into consideration by stratification. In this situation when 
performing the analysis we must take into consideration the presence of these strata. 

Literature on stratified experiments is vast and in particular in the context of the 
so called multicenter clinical trials it revolves around the van Elteren test [10], which 
is the optimal test when there is no interaction among treatment effect and strata. 
However if treatment effect is not constant across strata, Van Elteren test can became 
inefficient to detect differences between treatments. Therefore in literature we found 
alternatives to Van Elteren test which present good operating characteristics [2, 4, 3, 
5, 6]. 

Our interest on stratified tests arose dealing with a real industrial problem which 
was also affected by a very small sample size. Thus we wondered if existing tests 
are suitable for our purpose, and we decided to provide a general solution for the 
problem at hand, and make a comparison with existing ones. 

In the present paper we want to describe our proposed solution for stratified prob- 
lems, where variables can be of different nature (continuous, discrete, ordinal etc.). 
The proposed approach is nonparametric based on permutation tests and considers 
the NonParametric Combination (NPC) methodology [9] as tool to reduce the di- 
mensionality of the problem. This implies that factors of stratification can be also 
more than one. 

Section 2 is aimed at presenting and formalizing the problem. The idea at basis 
of the proposed procedure is described and after defining main notations and as- 
sumptions a detailed algorithm for achieving the NPC-based procedure is provided. 

In Section 3 we report the results of a simulation study aimed at comparing the 
performance of our proposed method with that of van Elteren test and of the Aligned 
Rank test proposed in [6]. We consider small sample sizes commonly of interest in 
practice. We investigate the case in which effect were constant across strata and case 
where effect are varying across strata, both under the null hypothesis than under the 
alternative. 

Finally, in order to illustrate usefulness of the proposed method in a practical 
context, in Section 4 we analyze data from an industrial problem. 


2 NPC-based permutation test for stratified analysis 


In the present section we describe the permutation approach proposed to deal with 
stratified problems. Such nonparametric solution is based on NonParametric Com- 
bination (NPC) methodology which allows to overcome methodological difficulties 
related to the presence of one or more stratification factors. 

Let Xisnh ~ F(x +% + dh) be a response variable for the i-th observation in the 
s-th stratum for the treatment h, y, represents the location effect of stratum s and ôn 
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is the effect of the treatment h, i = 1,..., ms, 5 = 1,...,S, h € {A,B}, and S is the 
number of strata. 
We are interested in testing for: 


HE : bs = Op 


(9 (1) 
H G : da > Op 
Taking into account the possible effect of stratification factors, we break down the 
problem into sub-problems for each stratum, i.e.: 


As) : Oa(s) = Op(s) 


(<) 
Hi): da) > Bis) 


(2) 
The idea of the NPC-based procedure is to suitably combine the p-values from each 
stratum. It is important to note that the effect of treatments may be multivariate. 
In order to clarify the steps of the procedure, in Section 2.1 we report the related 
algorithm. 


2.1 An algorithm for NPC-based stratified test 


In this section we describe the algorithm to achieve the NPC-based stratified proce- 
dure. For an overview on NPC-based testing and its properties see for example [1], 
[7], [8], [9]. Before listing the steps of the algorithm let us define some important 
notations. Given: 


Xp = [XU (h,1) ++ Fp, (gL) ++ X1 (hS) + Ang (1,5) 


a sample of size ny = Yi: ps for treatment h € {A,B} from an unknown distri- 
bution F, let us define the whole sub-sample X(s) for stratum s of size na, + ng,, and 
the related permuted sub-sample x p being u* any random permutation of labels 


; 
stnB): 


X(s) = [<1(A,s) <+ -Xna (A,s)*ng,+1(B,s) ++ *%n4,+np, 6,9] and 
Xs) = [as se Mit, I ATM 1B) > Sit ng) ®)| i 
These are the steps of the procedure: 
1. Vs=1,...,S: 


1.1. In order to testing (2) on xs compute a suitable test statistic 7(;), for example 
the difference of means for continuous data: 


_ 1 yss Dl pb | 
TpM(s) = ing, Li=1%i(45) ~ ng, Lj=1%j(By9) 


or the Anderson Darling test statistic in case of ordinal data: 
Tips) = EL Nigy Nics) ((na, +B) — N;(0s))] 
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where v is the number of categories, Nies) = Niva,) + Mig) in which N;4,) 
and N;(g,) are cumulative frequencies of the category i for stratum s in the 
treatment group A and B respectively (see [9, sec. 2.8.3]); 

1.2. perform a random permutation u* obtaining Xs): 

1.3. compute the permuted value of test statistics T*; 

1.4. independently repeat B times step (1.2)-(1.3) to obtain the permutation dis- 
tribution of test statistic 7(,). 

1.5. Estimate p-value: A, = Y7_,/ (Tey > 7(s)) /B and the related empirical sig- 


nificance function A“? = 4 +52; UT > T))/(B+ 1),b=1,...,B; 


2. Through a suitable combination function P(-) combine the p-values related to 
different strata obtaining: T) = T} = ®(A1,-...,As) and their related distribu- 


tion: Ty = DA 2, AN), b=1,...,B; 
3. compute the combined p-values Ap = X_I T3 > T(.))/B: 
4. reject the null hypothesis in (1) if Ae < a. 


3 A comparative simulation study 


In the present section operating characteristics of procedure proposed in Section 2 
are discussed compared with van Elteren test and the aligned rank test proposed in 
[6]. We performed 5000 Monte Carlo simulations based on B = 5000 permutations 
for permutation tests. We considered the following simulation settings: 


Setting 1: S = 3; n}, = 12 Vs = 1,...,S and h € {A,B}, data generated from N (0 + 
ôn + Ys, 1); 

Setting 2: S = 3; np, = 12,Vs =1,...,S and h € {A,B}, data generated from an 
ordinal variable with 10 categories; 

Setting 3: S = 5; nn, = 12 Ys =1,...,S and h € {A,B} data generated from N (0 + 
ôn + Ys, 1); 

Setting 4: S = 5; np, = 12,Vs =1,...,S and h € {A,B}, data generated from an 
ordinal variable with 10 categories; 

where (da, 6g) = (0.50,0.00) and for Setting 1: (71, %2, 7) = (0.25,0.5,0.55) and 
for Setting 3:(71, 7%, %;% 75) = (0.50,0.25,0.15,0.05,0.00). 

Actually we started from sample size np, = 5 Vs = 1,...,5 and h € {A,B}, but the 
aligned rank test showed an anti-conservative behaviour so we decide to consider 
a sample size where nominal œ for all three procedures is respected, in order them 
to be comparable in power. Table 1 and Table 2 show rejection rates of the tests 
when treatment effect is constant across strata and when it varies across strata re- 
spectively. 

For three testing procedures rejection rates are close to the nominal œ both in pres- 
ence of normal and categorical variables. For comparisons under the alternative hy- 
pothesis we note that van Elteren test presents a lower power with respect to its 
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competitors. In particular NPC procedure presents an higher power among three 
compared tests. We can also note that power tends to increase when increasing the 
number of strata. 


Table 1 Rejection rates at significance level œ = 0.05 over 5000 simulations, with treatment effect 
constant across strata. 


A=B | A>B 
Normal Ordinal Categorical Normal Ordinal Categorical 

NPC 0.046 0.048 0.605 0.630 

S=3 Align 0.050 0.053 0.534 0.517 
vE 0.047 0.047 0.520 0.500 

NPC 0.049 0.049 0.793 0.801 

S=5 Align 0.051 0.051 0.765 0.732 
vE 0.050 0.049 0.747 0.709 


Table 2 Rejection rates at significance level œ = 0.05 over 5000 simulations, with treatment effect 
varying across strata. 


A=B | A>B 
Normal Ordinal Categorical Normal Ordinal Categorical 

NPC 0.047 0.052 0.619 0.630 

S=3 Align 0.050 0.051 0.543 0.518 
vE 0.050 0.047 0.523 0.508 

NPC 0.052 0.050 0.792 0.808 

S=5 Align 0.052 0.051 0.764 0.765 
vE 0.050 0.048 0.746 0.740 


4 An application example 


We consider an industrial problem where a company producing bicycles has to 
choose among 2 different types of paints, say A and B, for bicycles. In order to 
compare quality of competitive paints, for each paint were recorded performance 
on frame of the bicycle. Performances were recorded by an instrument which inves- 
tigate the smooth surface of the piece, giving a continuous measure from 1 to 100 
intended as ”the largest the better”. There are 5 machines painting components and 
a specific machine could influence performance of paint so that we have to take this 
aspect into consideration. For each paint samples of size n = ri, np, = 25 with 
nn, = 5 for h € {A,B} and Vs = 1,...,5,S= 5 has been collected. Data are shown 
in Table 3. After applying the NPC procedure adopting the difference of means as 
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test statistic, we obtain a global p-value of META = 0.004 indicating a significant 
difference in performances between paint A and paint B, in the sense that paint A 
has greater performance with respect to paint B. An important feature of NPC proce- 
dure is that it is possible investigating which stratum mainly affect the global results. 
Note that we perform comparisons in the two directions (i.e. 6a > ôg and dg > da), 
for the sake of simplicity here show only comparisons of interest, i.e. Mi = 0.004, 


13^ = 0.004, AG7Î = 0.362, 4/74 = 0.254, ARTA = 0.247. 
As we can see from partial results, 2 out of 5 strata seem to mainly affect the 
global result. Extension to ordinal or mixed data is straightforward. 


Table 3 Response data for example application 


Stratum Paint A Paint B 

1 96.8, 96.7, 96.7, 93.2, 94.4 98.6, 99.4, 99.4, 99.8, 98.4 
2 85.2, 76.2, 83.1, 76.8, 76.8 95.8, 97.3, 97.9, 96, 97.6 
3 100, 96.1, 100, 100 , 99.4 100, 100, 100, 99.4, 99.6 
4 95, 93.8, 95, 94.3, 94.3 95.1, 94, 96, 94.3, 94.6 

5 92.1, 97.5, 98.2, 97.4, 97.6 94.2, 97.6, 98.2, 98.2, 98.6 


5 Conclusions 


In the last years some alternatives to the van Elteren test for stratified two-sample 
analysis have been proposed. In particular the aligned rank test is a potential choice 
given its good operating characteristics. In the present work we proposed a new 
nonparametric NPC-based stratified test and we compared its performance with that 
of van Elteren and aligned rank tests. 

Among tests compared NPC presented higher power and it respects the nominal 
a- level for very small sample size. 

Moreover, extensions to multivariate observations, to C > 2 samples, to repeated 
measurement data, can be obtained within our NPC-based approach and will be 
considered in future research. 
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Crowd and Minorities: Is it possible to listen to 
both? Monitoring Rare Sentiment and Opinion 
Categories about Expo Milano 2015 


Opinione di massa ed opinione di nicchia: possiamo 
misurare entrambi? Monitoraggio di sentiment ed 
opinioni rare riguardo ad Expo 2015 


Marika Arena, Anna Calissano and Simone Vantini 


Abstract The talk introduces a new aggregated classification scheme aimed to sup- 
port the implementation of text analysis methods in contexts characterised by the 
presence of rare text categories. This approach starts from the aggregate supervised 
text classifier developed by Hopkins and King and moves forward relying on rare 
event sampling methods. In details, it enables the analyst to enlarge the number of 
text categories whose proportions can be estimated preserving the estimation ac- 
curacy of standard aggregate supervised algorithms and reducing the working time 
w.r.t. to unconditionally increase the size of the random training set. The approach is 
applied to study the daily evolution of the web reputation of Expo Milano 2015, be- 
fore, during and after the event. The data set is constituted by about 900,000 tweets 
in Italian and 260,000 tweets in English, posted about the event between March 
2015 and December 2015. The analysis provides an interesting portray of the evolu- 
tion of Expo stakeholders’ opinions over time and allow to identify the main drivers 
of Expo reputation. The algorithm will be implemented as a running option of the 
next release of R package ReadMe 


Key words: Sentiment Analysis, Opinion Analysis, Rare Sampling Design, Expo 
Milano 2015 
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1 Introduction 


From the Ist of May 2015 to the 31st of October 2015, Milano hosted the 2015 
World Exposition (Expo Milano 2015). Doubts and uncertainties characterized the 
event at the beginning. The enthusiasm of hosting a world fair was accompanied 
by controversies concerning its set up; the long-lasting discussion about the invest- 
ments required to face its preparation alternated with the opportunity of exploiting 
positive externalises induced by the event . Discussions about corruption episodes 
were often on the news, and this cost overruns and delays. However, when the ex- 
position started, initial skepticism gave way to growing curiosity and, in the end, 
turned out in an unexpected success. Milano Expo 2015 involved 140 countries 
and was visited by 21 millions of people, with 7 millions of foreign visitors and 
2 millions of students [5]. “Feeding the Planet, Energy for Life” theme marks an 
opportunity to put the centrality of sustainability at the top of the political agenda 
and stimulated visitors with thought-provoking ideas coming from the pavilions of 
different countries. But how did the perception of Expo Milano 2015 evolve before, 
during, and after the event? Why was a changing dynamic registered? In the Talk 
we will answer these question proposing a new aggregate supervised classification 
scheme [2].The dataset used to train and test the model and to map precisely the 
reputation is Twitter. 


2 Method and Case Study 


In these talk, we will present in details the study about the web reputation of Expo 
Milano 2015, by analysing Twitter data through sentiment and opinion analysis. 
A soaring on-line discussions and web participation mirrored the bustle that sur- 
rounded the Expo, making social media an interesting channel for understanding 
what people was thinking and saying about the Expo. Among the existing social 
media platforms, we focus on Twitter because both it has a public philosophy and 
via API, Twitter offers a partial free download of its data, and it is micro-blogging 
platform, where the users share in 140 characters their own opinion about specific 
topics. The sharpness of posts helps the sentiment analysis performances, conducted 
on sentence-level data-set. 


2.1 The DataSet 


The tweets with tags related to Expo Milano 2015 were downloaded from the 17th 
of February 2015 to the 31st of December 2015. Both Italian and English written 
Tweets are analysed to cover the local and international nature of the public. Figure 
1 shows the amounts of analyzed Tweets (here aggregated per month), in Italian and 
English, respectively. 
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Fig. 1 Downloaded Tweets from the 17th February 2015 to 31st December 2015 via keywords 
concerning Expo Milano 2015. Data are here represented monthly aggregated, in both Italian and 
English languages. 


Italian Tweets count 
English Tweets count 


To fully capture the evolution of sentiment about Expo, we have to deal with 
a critical methodological issue, i.e.: the management of rare categories in the data 
set. The broadness of Expo event involve many different agents on different topics. 
The mission of the Expo was educating the public, sharing innovation, promoting 
progress and fostering cooperation among participating countries, the event put to- 
gether many different stakeholders, moved by diversified expectations and percep- 
tions, resulting in a complex and varying arrangement of interests and feelings. This 
heterogeneity was reflected in the on-line discourse, that was characterised by some 
“mainstream” topics discussed by plenty of people and some “less represented” 
categories - hereafter named rare categories - related to issues discussed by fewer 
people, but still relevant to understand the multifaceted reputation of the event. 


2.2 The Method 


The presence of rare categories is particularly critical for the implementation of 
supervised sentiment classifiers, which nowadays are an essential instrument for 
performing sentiment analysis. As discussed in details in [2], supervised sentiment 
classifiers require a training set. As hypothesis, the language used in the training set 
is assumed to be representative of the entire text [4], and it is labelled through hand- 
coding to obtain a better interpretation of the sentiment [1]. When a corpus of texts 
is characterised by the presence of rare categories, there is a non-null probability of 
not gathering any text belonging to these rare categories in the training set, with the 
risk of losing some relevant pieces of information. Against this background, in this 
talk we present a new aggregated supervised classification scheme for sentiment and 
opinion analysis. This new classifier takes advantage of the integration of standard 
sentiment and opinion analysis techniques proposed by [1], with rare event sampling 
techniques [2]. The rare event sampling technique are strictly linked with the strate- 
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gies known as choice-based, and case-control sampling [6]. In particular, we focus 
on the sampling solution proposed by [7]. The estimation of both broad-discussed 
and niche topics is now possible thanks to these new approach, contrary to current 
approaches which are able to deal with the former ones exclusively. In addition, this 
specific feature is particularly relevant from a managerial point of view because the 
identification and the analysis of rare categories could be used to anticipate future 
trends, and to identify and manage potential risks or opportunities. All the algorithm 
is run in R and it will be implemented as a running option of the next release of R 
package ReadMe. 

In the talk, we will outline the state of the art about opinion mining, with par- 
ticular attention to classification methods and, more specifically, aggregate super- 
vised ones, that represent the starting point for this work. The proposed classifi- 
cation scheme will be illustrated in terms of sentiment categories definition, texts 
pre-processing, variables definition, classification scheme evaluation, and results 
computation. All the algorithm steps will be displayed on the analysis of the web 
reputation of Expo Milano 2015. The talk finish with the results of a statistical com- 
parison performed between our classification scheme and other existing ones. 
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Using administrative data for statistical 
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L’uso di dati amministrativi per la modellizzazione 
statistica: un’applicazione all’evasione contributiva 


Maria Felice Arezzo and Giuseppina Guagnano 


Abstract Administrative data, gathered by public authorities with a general aim 
of control, are very precious sources of information because they allow to study 
phenomena that would remain otherwise unknown. On the other side, administrative 
data strictly contain the information they were collected for, and to be used for 
statistical purposes they need to be integrated. This work shows the potentials of 
the integration of three data sets for statistical modeling: the audits carried out in 
Italy in 2005 by the National Institute of Social Security on building and costruction 
companies, the ASIA archive of Istat and the ”‘Studi di Settore” of the Italian 
Revenue Agency. 

Abstract / dati amministrativi, raccolti dalle istituzioni pubbliche per scopi gen- 
eralmente di controllo, sono fonti informative estremamente preziose in quanto 
permettono spesso di studiare fenomeni che in altro modo non potrebbero es- 
sere conosciuti. D’altro canto, proprio perchè rispondono a finalità specifiche, le 
indagini amministrative non contengono informazioni aggiuntive rispetto a quelle 
per le quali sono state pensate. Il lavoro illustra le potenzialità dell’integrazione di 
tre basi dati da fonte differente: le ispezioni INPS, l’archivio ASIA dell’Istat e gli 
Studi di settore dell’Agenzia delle entrate. La sperimentazione è stata condotta sulle 
imprese che operano nel settore delle costruzioni. 
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1 Introduction 


Administrative data are archives of great interest as they often contain information 
available only to public authorities responsible for the control of some phenomena. 
Almost always, though, these files do not contain information other than those for 
which they were collected (a typical example are the socio-economic characteris- 
tics of the individuals), as the purpose underlying their gathering is not statistical 
modeling. For this very same reason, administrative data require, on the one side, a 
throughout pretreatment and validation process and, on the other, the development 
of statistical methodologies that allow for the drawing of valid inferences. 

The purpose of our work is to draw the entire “production chain”: a) the creation 
of a dataset with all relevant variables, b) the evaluation of the dataset quality, c) the 
development of a statistical method suitable for the data at stake. 

The case study is on the detection of the firms which evade worker contribu- 
tions because they employ off-the-book workers (i.e. employee who are completely 
unknown to fiscal authorities) 


2 Creation of the data set 


Our starting point is an administrative dataset on the audits carried out in Italy in 
2005 by the National Institute of Social Security (INPS henceforth) on building and 
construction companies (NACE section: F). It amounts to a total of 31,658 inspec- 
tions on 28,731 firms. The global amount of firms operating in the building indus- 
try in Italy in the same year was N = 595,226. Audits data allow to observe the 
compliant/non-compliant behavior. 

Following the idea that the risk of a non-compliant behavior can be predicted 
by the economic characteristics of the firm, we integrated the information of audits 
with two other sources of data. The first is the ASIA archive owned by the National 
Institute of Statistics (ISTAT). It contains data on the legal structure, turnover and 
number of employee and is a high quality source of data as the information are val- 
idated through a very careful process. The second, owned by the Italian Revenue 
Agency, is the so called ‘Studi di Settore’ (SS in the following) archive. It contains 
an exhaustive list of information on corporate organization, firm structure, manage- 
ment and governance. 

The three data sets were merged using VAT numbers and/or tax codes. Surpris- 
ingly the match rate was only 51% meaning that the number of firms in the merged 
archive is 14,651. 

The original variables were used to build economic indicators which can be 
grouped in the following different firm’s facets: a) 9 indicators for economic di- 
mension, b) 13 for organization, c) 6 for structure, d) 6 for management, e) 11 for 
performance f) 38 for labor productivity and profitability g) 3 for contracts award 
mode h) 7 variables for location and type. The final dataset had 93 independent 
variables observed on 14,651 building companies with a match rate of 51%. The 
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Table 1 Datasets characteristics 


Data Content Individual Dimension 

Owner 

INPS Inspections outputs (2005) Inspection 31,658 inspections on 28,731 

firms 

Revenue Studi di settore (2005).Mod- Firm Universe of firms with at 

Agency els: TG69U, TG75U most 5 million euros of in- 
(SG75U),TGS50U (SG50U come 
and SG71U), TG70U 

ISTAT Asia Archives (2005) Firm Universe of firms 


variable to be predicted is named Y and it takes value 1 if in a firm there is at least 
one off-the-book worker and 0 otherwise. In the following we will refer to the fi- 
nal dataset as the integrated db because it gathers and integrate information from 
different sources. 


3 The assessment of the integrated dataset 


As we said, the matching rate was 51% which means that we had information on 
the features of interest for (roughly) half of the firms in original INPS database. We 
studied inspection coverage and the risk of non complying for different turnover 
class and corporate designation typologies and over the territory. The idea was to 
verify if a whole group of firms (for example all the companies in a geographical 
region) was lost because of the merging process. 

We checked for: Regions (20 levels), Number of employee (9 classes), Legal 
structure (5 levels), Turnover (11 classes); we then made sure that during the match- 
ing procedure, no whole groups of individuals were lost. 


4 The Model 


Under a statistical point of view, there are two main methodological issues arising 
from the type of data we use. The first is the non-randomness of the inspections and 
the second is that the fraction of inspected firms in the population is low. 


SELECTION BIAS IN THE SAMPLE OF INSPECTED FIRM. To detect undeclared 
work, an inspector audits firms. Inspected firms are not randomly chosen; they are 
chosen because the inspector thinks that there are some off-the-book workers and 
s/he has strong incentives to target the “right” firms (i.e. the irregular ones). We can 
think of the decision to inspect a firm as a rational process in which the inspection 
is made if the utility to inspect, U4, (i.e. find undeclared workers and get a benefit) 
is higher than the utility of non-inspect, U4. Moreover we can observe the status of 
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the i—th firm (regular or not) only if it has been inspected, otherwise a censoring 
process intervenes. It is obvious that there is a strong selection bias in the sample of 
inspected firms. 

As it is well known, [3] proposed a useful framework for handling estimation 
when the sample is subject to a selection mechanism. In the original framework, the 
outcome variable is continuous and can be explained by a linear regression model 
(called output equation), with a normal random component; in addition to the output 
equation, a selection equation describes the selection rule by means of a binary 
choice model (probit). 

In our framework the output equation defines the compliance decision, so the 
dependent variable is binary, and the selection equation refers to the decision of 
inspecting a firm. Just as the inspection decision, the evasion is based on a rational 
process and it happens if the utility of evading, U”, is greater than the utility of 
complying U”. The corresponding econometric model, in its general form, is: 


Y¥* =UC -UC = XB + £1 (la) 
At = UA — UA = X70 + 8% (1b) 


where X; = (X1;,X2;) is a vector of exogenous variables (namely, Xj; for Y; 
and X»; for Aj), containing all the relevant covariates. 

Since we cannot observe directly the utilities (neither those determining compli- 
ance, nor those governing the decision to inspect), we assume that if in equation (1a) 
Y;* > 0, the firm does not comply, otherwise it does. Let’s define a dummy variable 
Y; which we can observe and that denotes the alternative selected: 


1 ifY*>0 
Les 2 
0 otherwise 
Similarly, we can define an observable dummy variable A; for the inspections, 
such that: 
1 ifA*>0 
i= ate (3) 
0 otherwise 


The p.d.f. of Y; and A; is Bernoulli with probability of success respectively equal 
to yw and 47 and depending on X;B and on X2;0. A selection bias exists if 
corr(€1, €2) = P is not null. 

As it is known (see for example [2]), the likelihood function for the Heckman’s 
selection model is: 


n 


1m) =J [1-a] a [FIA = 1a (2%) F (4) 


i=1 
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where n = (B, 9, p) is the vector of parameters to be estimated. 


THE CASE-CONTROL SETTING. In this sampling design [4], also known as 
response-based, samples of fixed size are randomly chosen from the two strata iden- 
tified by the dependent variable A. In particular n4 units are drawn at random from 
the N4 cases and nz from the N} controls. 

The likelihood function is the product of the two stratum-specific likelihoods 
and depends on the probability that the individual is in the sample, and on the joint 
density of the covariates: 


NA Na 
Pr(XijAi=1,S=1)-[]JPr(XjAi=0,5=1). (5) 
=1 i=1 


l 


The c-c design is particularly suited in our study because the probability that a 
firm is inspected is very low and therefore it is much more convenient to directly 
sample from the two strata (inspected/non-inspected). 


A BINARY CHOICE MODEL WITH SAMPLE SELECTION AND CASE-CONTROL 
SAMPLING SCHEME. In the following we provide the likelihood function under the 
framework of interest, i.e. a sample selection mechanism with a severe censoring 
process. The interested reader can find the full proof and the simulation results in 
[1]. 


We make the following very general and non restrictive assumptions: 


1. we have a set of fully informative and exogenous covariates X; = (X1j,X2;); 

2. conditional on the covariates, the probability that an observation is uncensored 
doesn’t depend on its value, i.e. P(A; = 1|S; = 1, X;,¥;) = P(A; = 1/5; = 1, X;); 

3. the set of covariates X 1;, specific for Y;, and the set _X;, specific for A;, may have 
common elements but they cannot fully overlap; 

4. the probability of being in the sample does not depends neither on the covariates 
X; nor on Y;. More precisely, letting S; be a binary variable which takes value 
1 if the i—th individual is in the sample and 0 otherwise, it is true that P(S; = 
1|X;, Yj, A; ai) P(S; IJA; ai). 


Assumption (1) means that it does not exist correlation between the covariates 
and the residual terms in equations (la) and (1b). Assumption 2 is justified be- 
cause, as the covariates are informative, all the information brought by Y; is con- 
tained in X;. Assumption (3) is necessary for parameters identification (exclusion 
conditions). Assumption (4) is typical in the response-based sampling framework 
and no further explanation is required. 

Under the conditions stated, the likelihood function for a binary choice model 
with sample selection under a response-based sampling is: 
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where 47(X,;) is the probability that an observation is uncensored and y7(X1;) 
is the probability of observing Y = 1 given that the observation is uncensored; as 
already said, n4 is the number of units sampled from the N4 uncensored observations 
and nj is the number of units sampled from the Nj censored observations; ny, is the 
amount of units in the sample having Y = y, with y = 0,1. 

It’s easy to understand that the likelihood (6) is a weighted version of (4), and 
the weights simply take into account the sampling design. Note also that in the 
maximization process the term f (X;|5; = 1) is non influential, as it does not contain 
any information on the vector of parameters n, and that in our estimator the only 
quantities to be known at the population level are N4 and N. 
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Are Numbers too Large for Kids? 
Possible Answers in Probable Stories 
Sono troppo grandi i numeri per un bambino? 
Rispondono le storie di probabilita 


Monica Bailot, Rina Camporese, Silvia Da Valle, Sara Letardi, Susi Osti! 


Abstract Regardless of calculus ability, children need to approach statistics and 
stochastic literacy as soon as possible, in order to build up their ability to deal with 
uncertainty when making judgements and decisions. Moreover, statistics and 
probability are mandatory in school curricula since primary school. Maths can be 
narrated, stories are engaging and playful hands-on activities help kids to learn. So 
then, why not convey statistics and probability by means of fables? Animated fables, 
where kids play roles and immerse themselves into stories that evolve through events 
and decisions based on numbers and statistics. The paper presents two fables on 
probability, large numbers and repeated observations. They are part of a larger set of 
StatFables that are being written and tested in schools and libraries. 

Abstract Indipendentemente dalle abilità di calcolo, è importante che i bambini 
entrino in contatto con i rudimenti di statistica e probabilità, per imparare a gestire 
l’incertezza nell’esprimere giudizi e prendere decisioni. Statistica e probabilità, 
inoltre, sono obbligatorie nei curricula scolastici sin dalla scuola primaria. La 
matematica si può narrare, le storie catturano l’attenzione e le mani in pasta aiutano 
ad imparare. Allora, perché non insegnare statistica e probabilità con le fiabe? Fiabe 
animate, di cui i bambini sono protagonisti, immersi in storie che evolvono attraverso 
eventi e decisioni basati su numeri e statistiche. L’articolo presenta due fiabe su 
probabilità, grandi numeri e osservazioni ripetute. Fanno parte di un insieme di 
StatFiabe in corso di scrittura e test in scuole e biblioteche. 

Key words: Kids, uncertainty, large numbers, probability, bayesian reasoning 


1. StatStory One: 1, 10, 100, 1000 Nights of Silver Moon 


! Istat . Italian Institute of Statistics, corresponding author rina.camporese@istat.it 


89 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


90 M. Bailot, R. Camporese, S. Da Valle, S. Letardi, S. Osti 
In the village of WhoKnowsWhere, every night a witch throws into the sky the moon, 
symbolically represented by a coin. And every night the merchant Hamlet is impatient 
to see the face showed by the coin: the silver face of the moon will light up his journey 
to the city where he will sell his merchandise, the black side will force him to wait in 
a dark night. Every night a doubt, every night two possible outcomes. Imagine one, 
ten, a hundred, a thousand nights... how many of them will the merchant spend on 
the road? And how many will he have to wait for a better chance? This is the mystery 
the merchant must solve to free his daughter Ada from the cave where she is held in 
captivity by the witch. 

Ada has a passion and a flair for maths and her father keeps bringing her math 
books to kill the boredom of solitary life. Another game she plays in her long and 
boring days in the cave is to pile up white and black pebbles counting propitious and 
inauspicious nights. After many days of observation Ada realizes that the two piles of 
stones are getting closer and closer in size and in number of pebbles. This discovery 
and the remembrance of the Law of Large Numbers she had read somewhere - a law 
which she initially thought was a funny idea popped out in the mind of a sloppy 
mathematician with weak practical sense - make her resolve the arcanum and break 
the spell. 


2. Why Stories? Why Kids? 


Why stories to help children (only children?) deal with numbers, formulas and 
probability? Because stories have always been a privileged way to transmit culture. 

Linguists say that a mathematical formula is an extreme form of text (Sabatini, 
2016). A formula, a theorem... are stories in a nutshell, a mathematical nutshell. 
When linguists could see an extreme form of text in a formula, we statisticians could 
see an entire story in it, therefore the story contained in a formula can be unveiled in 
a full verbal narration. Maths can be narrated and stories are engaging. So then, why 
not convey statistics and probability by means of tales? 

If statistical information is communicated in mathematical formats people have 
troubles in correctly reading and interpreting it. When it comes to the ability to deal 
with data, uncertainty and mathematical representation of phenomena, the words 
innumeracy and statistical illiteracy are the most appropriate for the majority of the 
population (Till, 2014). Needless to say this is a major cultural issue in our society. 

Regardless of calculus ability, children need to approach statistics and stochastic 
literacy as soon as possible, in order to build up their ability to deal with uncertainty 
when making judgements and decisions and to understand numerical information. 
That’s why statistics and probability are mandatory in early school curricula. 

Children aged eight to ten do have probabilistic intuitions and can develop 
secondary intuitions. They can also reason on proportions before being able to 
formally deal with fractions (Fischbein, 1970). Here is what Christoph Till wrote in 
2014 “[...] risk and decision making under uncertainty can be a prevailing, exciting 
and meaningful topic at the end of primary school with sustainable effects. [...] it is 
possible to foster elementary competencies for risk assessment and probabilistic 
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decision making in fourth class. [...] in a playful learning environment, as advised by 
results of cognitive psychologists. With these representations children can think 
probabilistically without the need of fractions or percentages. [...] As Gigerenzer 
(2011; 2013) has repeatedly pointed out elementary probability concepts should be 
taught in an informal and heuristic manner at an early stage.” 


2.1. Words in action: animated stories 


Kids learn by playing and immersing themselves into stories. They also find hands on 
activities extremely engaging; and this is true also for Maths, Stats and Probability 
(Martignon, 2009). Listening to stories can be fun and relaxing, but being the 
protagonists of an animated story can be even more exciting. That’s why fables can 
be animated and kids can play roles and immerse themselves into stories that evolve 
through events and decisions based on numbers and statistics. 

The story on the Law of Large Numbers is suitable for children aged four to ten; 
the next story on Bayesian reasoning is thought for children aged six to twelve. The 
action associated to them can be modulated depending on children’s familiarity with 
numbers, fractions and spreadsheet. Kids are engaged in the story through telling 
small episodes, performing actions and posing questions. 

Since questions are the real engine in the learning process and active participation 
enhances the engagement (Chavannes, 2016), children are guided to find out answers 
by carrying out experiments and reasoning on the results. 


3. StatStory Two: The Witches of Bayes? 


“The Witches of Bayes” is an animated fable to be played by children from 9 to 12 
years of age. The aim of the game is to become familiar with the Bayesian reasoning 
and the use of all the pieces of information available in order to make decisions. 

A group of witches haunts the village of Bayes. Every day one of them, randomly 
chosen and unknown to the villagers, asks for a dish of food by placing her hat outside 
the cave. The witches have different tastes, some love the sweet and some the salty. 
If the dish offered meets with the taste of the witch, the day passes peacefully, 
otherwise there will be trouble for everyone. Every morning Coco Head, the head of 
the village, relies on the fate and throws a ritual coin with one face indicating salty 
and the other one sweet; based on the visible face he gives orders to the kitchen. The 
gloomy days, however, are numerous. 

Nora, a young girl, notices that witches have different hats, some are black and 
some others are purple. It seems to her that purple hats do appear more often and that 
the witches rather prefer sweet dishes, but she is not certain. She also wonders whether 
there be a link between the colour of the hat and the taste of the witch. Then she 


? Poster presented at SISBAYES Meeting in Rome, 7-8 February 2017. 


92 M. Bailot, R. Camporese, S. Da Valle, S. Letardi, S. Osti 
discovers a parchment full of information, thanks to which she develops an alternative 
strategy to choose the dish. She tries to convince the village head to change the 
traditional method, but he is unshakable. Nora then asks for permission to serve the 
meal herself and she prepares in secret, when necessary, some dishes in alternative to 
those indicated by the ritual coin. After a while the villagers realize that the 
devastating raids of the witches are less frequent. Coco Head organizes a ceremony 
in honour of the god Bias to thank him for his increased magnanimity. 

Nora, the protagonist, finds a way to decide despite the uncertainty, using 
everything she knows. Coco Head, on the contrary, does not change his actions when 
new elements arise; worse than that, he finds a way to reinforce his beliefs thanks to 
elements that should rather put them into question. He suffers from the status quo bias, 
i.e. his perception of risk, relative to changes, is amplified by the unjustified belief 
that a different choice can only make things worse. The ending is controversial to 
inflame the discussion. 

The tale is not read nor told, but it is animated by the children with the help of 
hats, giant coins and bags through which the selection criteria of the dish are applied. 
The choices of Coco Head and Nora are simulated for the thirty-one days of a month 
and children evaluate which method can lead to more favorable results. After that, 
Nora’s diary is studied, for she has been noting down good and bad days for ten years. 
Behaviours, methods and results are discussed together with the children. If they are 
familiar with fractions, they can also apply Bayes’ theorem, looking into the story and 
the underlying data for the necessary information. If the kids are familiar with 
spreadsheet, Nora’s diary can be “calculated” by simulating the sequence of results in 
a decade of choices, with one method or the other. 


4. What's around the stories 


Authors in Istat are working on a set of activities devoted to promote statistical literacy 
through unconventional instruments and non-specialized languages meant to be used 
in schools, libraries and events such as the Festival of Statistics and the European 
Researchers’ Night. 

These two stories belong to a set about statistics, basic descriptive measurements 
and probability and are being tested in schools and libraries. They aim at conveying 
statistics and probability through words, by drawing a the path to numbers and 
formulas that passes through verbal narration of concepts and active experimentation 
thanks to playful activities. An apparent paradox of such activities is that, sometimes, 
they deal with mathematical concepts without showing any formula, sometimes not 
even a number, and that it done on purpose. 
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A polarity-based strategy for ranking 
social media reviews 


Una strategia basata sulla polarita per ordinare 
le recensioni sui social media 


Simona Balbi, Michelangelo Misuraca and Germana Scepi 


Abstract The Opinion Mining methods are widely used to analyse and classify the 
choices, preferences and behaviours of consumers through the opinions gathered on 
the Web. On social media like TripAdvisor such opinions are usually expressed with 
a score and a short text. This paper proposes a strategy for ranking reviews using a 
scale based jointly on the rating and on the text of the reviews. 

Abstract / metodi di Opinion Mining sono oggi ampiamente utilizzati per analizzare 
e classificare le scelte, le preferenze e il comportamento dei consumatori attraverso 
opinioni raccolte sul web. Sui social media come TripAdvisor tali opinioni vengono 
solitamente espresse con un punteggio e con un breve testo. In questo lavoro si 
propone una strategia per ordinare le diverse recensioni con una scala di misura 
basata sia sul punteggio sia sul testo scritto. 


Key words: Textual Data, Opinion Mining, Ranking 


1 Introduction 


With the rapid expansion of social media, it is more and more widespread the prac- 
tice of sharing opinions on the Web. The ways for expressing those opinions are 
many: numbers, texts, emoticons, images, videos, audios. There are often a joint 
use of these communication tools. It is becoming a habit for users to evaluate the 
products/services they buy/use, by describing their personal feelings and judgments. 
We can find online websites specialised in one or more topics, where people 
can give their opinion using an evaluation scale (e.g., from “terrible”’=1 to “excel- 
lent’=5), visualised by bullets or stars, and combined with a written description. 
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As this practice is nowadays considered the core of many marketing strategies, 
there is a large interest on how to extract knowledge from such a kind of information. 

Opinion mining procedures have been developed with the main goal of under- 
standing the mood in a text, transforming it in a numerical value. The basic idea is 
identifying positive, negative, or neutral viewpoints. Researchers involved in defin- 
ing proper methods for mining opinions on the Web are mainly computer scientists 
and computational linguists. They often claim to use statistical techniques. 

The main point we are interested in this paper is that we often see the lack of a 
statistical perspective. Statisticians are professionally involved into the problem of 
quantifying something that is not quantitative in itself. Furthermore, the implications 
in the choice of a scale, or in the choice of a weighting system, or in the choice of 
the proper method for analysing those unconventional data pertain to statisticians. 

Here we focus our attention on the so called rating-inference problem [6], and its 
implications when we refer to “reviews and ratings” social media like TripAdvisor. 
In this kind of media we usually find ratings in a 1-to-5 stars system, together with 
written judgments. The challenge is stimulating for a statistician: on one hand, we 
have a judgment in a 5-point scale; on the other hand we have a (usually) short text. 
We propose a two-step strategy for taking into account jointly the two assessments 
and defining a unique rating. 

The paper is organised as follows. Section 2 defines the theoretical framework. 
Section 3 considers the case study. The proposed strategy is presented in Section 4, 
while the main results of applying the strategy on TripAdvisor reviews are discussed 
in Section 5. 


2 Theoretical framework 


Sentiment analysis (SA), also known as opinion mining (OM), refers to the analysis 
of people’s opinions, attitudes, or emotions, in a written text. Note that SA is gener- 
ally used in industry, while both SA and OM are used in academia. In the following, 
we interchangeably use the two terms. Opinions are usually published in specialised 
websites, devoted to peculiar topics like cinema, e-commerce, and so on. 

The main goal of SA is to classify documents on the basis of their “polarity”. 
The term polarity is used in linguistics for distinguishing affirmative and negative 
forms. For a wide review of the different methods of SA refer to [1] [7]. In literature 
there are three different steps in determining the polarity: 


1. the subjectivity/objectivity of a text (SO-polarity): decide if a text has a factual 
nature or expresses an opinion on its subjective matter. 

2. the positivity/negativity of a text (PN-polarity): decide if a subjective text ex- 
presses a positive or negative opinion. 

3. the positivity/negativity strength of a text (PN-strength): identify different grades 
of positive or negative sentiments in opinions. 


These steps are sequentially ordered, but it is not mandatory to perform all three. 
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Focusing on the unit of the analysis, we can consider different levels: a document- 
level, a sentence-level, an aspect-level. The first two levels are usually considered 
in the so called polarity-based SA, while the latter one is used in a topic-based 
perspective. The document-level aims at defining the polarity of each document, 
i.e. if it expresses a positive or a negative sentiment. In the sentence-level each 
document is segmented into sentences, and we want to determine the polarity of 
each sentence. The PN-polarity is quantified by considering a score of -1, 0 and 
1 for negative, neutral and positive sentiment, respectively [2]. Some authors have 
proposed different scoring systems by defining the polarity not only in terms of sign 
but also taking into account the PN-strength of the sentiment [5]. The aspect-level 
SA aims at quantifying specific aspects and it allows to obtain fine-grained results. 
The aspect-level SA requires a greater computational complexity. 

In this paper, we aim at determining the PN-polarity of a document, by consid- 
ering a sentence-level approach. This is the first step of a mixed strategy that uses 
both textual and numerical information. 


3 The Uffizi Gallery on TripAdvisor 


In the last decades several private and public institutions operating in the field of 
cultural heritage, like museums, have looked at the visitors from a visitor satisfac- 
tion perspective. The so called museum audience is became strategically central, 
because it has a major connection to museums’ sustainability. In this framework, 
it is more and more important to collect and analyse data coming from different 
sources. Together with classical sample surveys, carried out on a limited number of 
visitors, it is possible to use secondary data available on the Web. This huge amount 
of online data can be seen in a big data frame, as they have different natures and are 
available in real-time. In this paper, we study the audience of the Uffizi Gallery by 
analysing a set of reviews published on TripAdvisor. 

TripAdvisor is a social media specialised in tourism reviews about both busi- 
nesses and attractions. According to the most general classification of social media, 
it can be defined as a “reviews and ratings” media. It has been founded in U.S. in 
February 2000. Since mid-2010 is both an online service on the Web and a mobile 
application on portable devices. It has been one of the first websites to implement 
user generated content. 

We use a scraping approach by launching a custom crawler on February 1 
2017. In this way we retrieved 9639 reviews written in English and posted on Tri- 
pAdvisor from February 27 2003 up to February 10‘ 2017. The crawler has also 
provided some meta-information about the author of each review (e.g., location, 
contribution level on TripAdvisor, number of submitted reviews) and about the re- 
view itself (e.g., date, rating, device used for publishing the review). Here in the 
following we only focus our attention on the reviews and the corresponding ratings. 
We decide not to perform any lexical pre-treatment on the reviews. Only the parts 
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not in English have been deleted, because some reviews also contained sentences in 
the mother-tongue language of the author. 


4 A two-step strategy for a polarised rating/ranking 


The rating scale used by TripAdvisor is an ordinal scale. In details, the ratings from 
1 to 5 are associated with the terms terrible, poor, average, very good and excellent, 
respectively, and a corresponding number of bullets. In Fig. 1 it is possible to see the 
rating distribution of the Uffizi Gallery updated at April 5°" 2017. The ordinal rating 
can be seen as a global and comparable measure of the experience, while the textual 
description is an evaluation highlighting which aspects are positive and negative. 
Therefore, we propose a two-step strategy for the computation of a polarised rating 
of a review by combining the rate and the sentiment in the text. 


Visitor rating “Very busy!” 
Excellent 6282 A museum | have always wanted to visit! And it didn't 
Very good 2383 disappoint fantastic architecture and collection of paintings, 
the building also affords great views towards the Ponte. 
Average 932 read 


239 


104 


Poor 


Terrible OOOO Reviewed yesterday 


O via mobile 


Fig. 1 Visitor rating distribution at April 5° 2017 


Step 1: Computing the reviews’ polarities 

In order to compute the polarity of the reviews, we follow an SA sentence-level 
approach. This level seems to be more suitable, because in these texts each sentence 
includes an opinion of the contributor on the different aspects of the offered service. 

The polarity scores have been calculated by using the R package sentiment r. The 
equation used in this package is based on the concept of valence shifters [8]. It is a 
procedure allowing to capture the polarity of a sentence by considering the context 
of use of its terms. The polarity of each term is weighted by taking into account 
negators (e.g., “never”, “none”), amplifier and de-amplifier (e.g., “very”, “few”), 
adversative and contrasting conjunctions (e.g., “but”, “however”). This weighting 
system allows to emphasise or dampen the positivity and negativity of the terms, 
and obtain a more proper measure of the sentence sentiment. 

Each review d; (with i = 1,...,n) is segmented into a set Sy, of qi sentences 
{Si1,.-- Sijyees Sigg }> by considering as separators only full stops, question marks 
and exclamation points. Each sentence j is represented as a sequence of its p; terms 
{Wijls +++) Wijks tale ;Wijp;}- Each term w;;x in the sentence s;; is compared with a 
lexicon of polarised terms, with a score ryw;,, of -1 for negative terms and 1 for 
positive terms, respectively. The terms not included into the lexicon are assumed to 
be neutral, with a score Fwijk equal to 0. 
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The polarity score of each sentence depends on the dictionary of polarised terms 
used into the analysis, while the PN-polarity of the whole document depends on 
the polarities of its sentences. Different dictionaries are available. It is possible to 
consider manually created resources or automatically and partially automatically 
created resources. There are many papers in literature dealing with the problem of 
choosing one dictionary [4]. We use the Jockers dictionary, a lexicon of more than 
10000 terms developed by the Nebraska Literary Lab for the R package syuzeht [3]. 

The final polarity score r;,; of each sentence is computed as the sum of its 
weighted term scores r*w;,, (taking into account the shifters) on the square-root 
of the sentence length: 


ry = 5 (1) 
vPi 
As we are interested in computing a polarity score for the whole review, we 
compute the score rg, of each document by a down-weighted zeros average of its 
sentence polarities. In this averaging function the sentences with neutral sentiment 
have minor weight: 


di 
L Usi; 


"d= n= (2) 
dit V/log(2 — Gi) 
where g is the number of sentences with a positive or negative polarity. The logic 
of down-weighting neutral sentences is that they have less emotional impact in the 
review than the polarised ones. 


Step 2: Computing the polarised rating 

The new score for each contributor is obtained by summing the original rating 
with the polarity score of the review. Because of the unboundedness of the polar- 
ity scores, we bring all values into a range [0,1]. For each category cy in the rating 
system (with h = 1,...,H), the 7g, rescaled scores are computed as: 


Ô 


i 


= max rg, — min rg. 3) 
dj€cp " djecp ' 

The resulting scoring system has a range [1,H+1], where 1 expresses the strongest 
criticism and H+1 expresses the strongest appreciation. The polarised rating can be 
interpreted as a ranking, because the new score allows the sorting of the reviews. 
Users can not only browse and read the reviews by rating, but also with respect to 
the sentiment. 
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5 Main results 


After segmenting the 9639 reviews, we have obtained 48684 sentences. In Tab. 1 it 
is possible to see the statistics about the sentences with respect to the PN-polarity. 


Table 1 Statistics on sentences by PN-polarity 


NEG NEU POS ALL 
sentence 7653 10072 30959 48684 
token (N) 131307 113384 517841 762482 
type (V) 6975 5597 9719 15524 
hapax (V1) 3318 2827 4543 7228 
type/token ratio 5.31% 4.94% 1.88% 2.04% 
hapax/type ratio 47.57% 50.91% 46.74% 46.56% 


We note that the number of positive sentences is much greater than the number 
of the negative ones. This result is consistent with the distribution of the rating 
expressed on TripAdvisor (see Fig. 1). 


Fig. 2 Community detection on co-occurence network of terms: positive sentences 
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For visualising the peculiar language associated with positivity and negativity, we 
explore the sub-corpora of positive and negative sentences. After constructing the 
co-occurrence matrices, the relations among terms are visualised. For identifying a 
community of terms we consider the edge betweenness (through IRAMUTEQ’). 

In Fig. 2 the communities related to the positive sentences are highlighted in dif- 
ferent colours. We see that each community represents a topic related to the Uffizi 
experience. The main positive aspects are connected with the way the tickets have 
been bought, with the possibility of reserving a guided tour, with the different as- 
pects related to the concept of Art, with the most important Masters in the gallery. 
We note the term “but” in the middle (in terms of betweenness). Its adversative role 
give, as seen above in Sec. 4, a different weight to the sentence polarities. 


Fig. 3 Community detection on co-occurence network of terms: negative sentences 


Analogously, in Fig. 3 the communities related to the negative sentences are high- 
lighted. It is interesting to note that although we find some topics in common in the 
two networks, we find different paths. For example, “art” and “gallery” in the net- 
work of negative sentences are related to the inefficiency of the “staff”, while in the 
network of positive sentences (Fig. 2) the same terms describe the visit experience. 


! http://www.iramuteg.org/documentation 
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Fig. 4 Distributions of the 
TripAdvisor ratings and the 
polarised ratings 


In Fig. 4 we show the distribution of the original ratings and the distribution of 
the polarised ratings. The new scale introduces a useful gradation in the judgments. 
Here in the following we can see two examples of reviews rated | by the contributor, 
and rated 1.0 and 1.9 by the polarised rating, respectively: 


Review #2061: I’m not sure why this museum is so famous, the truth is: it’s extremely 
boring, full of statues and religious paintings, all the same, not even the building is nice!! 
The line up is insane, even if you buy tickets in advance, it’s ridiculous, lots of people! 
Worthless!!! Save yourself the trouble, go browse Florence, so much to see outside. Totally 
waste of time and energy, nothing interesting, we were in and out!! Horrible!! 

Review #1121: Buy your tickets online beforehand otherwise you will wait a long time in 
a queue. There is a very good rooftop cafe with reasonably priced food and drinks. Some 
spectacular photo opportunities through the windows overlooking Florence. 
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Monitoring the spatial correlation among 
functional data streams through Moran’s Index 


Monitoring della correlazione spaziale tra data stream 
funzionali attraverso il Moran Index 


A. Balzanella, S.A. Gattone, T. Di Battista, E. Romano, R. Verde 


Abstract This paper focuses on measuring the spatial correlation among functional 
data streams recorded by sensor networks. In many real world applications, spatially 
located sensors are used for performing at a very high frequency, repeated measure- 
ments of some variable. Due to the spatial correlation, sensed data are more likely to 
be similar when measured at nearby locations rather than in distant places. In order 
to monitor such correlation over time and to deal with huge amount of data, we pro- 
pose a strategy based on computing the well known Moran’s index on summaries of 
the data. 

Abstract I! presente articolo è incentrato sul misurare la correlazione spaziale tra 
data stream acquisiti da una rete di sensori. In molti campi applicativi, mediante 
l’utilizzo di sensori è possibile effettuare, ad elevata frequenza, misurazioni ripetute 
di fenomeni reali. Spesso, a causa della posizione geografica dei sensori, è presente 
una correlazione spaziale tra le osservazioni. In particolare, sensori spazialmente 
vicini registrano dati tra loro pi simili rispetto a quanto rilevato da sensori lontani. 
Al fine di monitorare tale correlazione nel tempo, tenendo conto dell’elevata nu- 
merosit delle registrazioni effettuate dai sensori, si propone una strategia basata sul 
calcolare il ben noto indice di Moran su sintesi dei dati. 
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1 Introduction 


Functional Data Analysis (FDA) has become a topic of interest in Statistics due 
to the increasing ability to measure and record over a continuous domain results 
of natural phenomena [6]. In environmental sciences, monitoring a physical phe- 
nomenon in different places of a geographic area is becoming very common due 
to the availability of sensor networks which can perform, at a very high frequency, 
repeated measurements of some variable. We can think, for instance, at temperature 
monitoring, seismic activity monitoring, pollution monitoring, over the locations of 
a geographic space. In this context one works with data having complex characteris- 
tics including spatial dependence structures. Often, the data acquisition is performed 
by sensors having limited storage and processing resources. Moreover, the commu- 
nication among sensors is constrained by their physical distribution or by limited 
bandwidths. Finally, the recorded data relate, often, to highly evolving phenomena 
for which it is necessary to use algorithms that adapt the knowledge with the ar- 
rival of new observations. The data stream mining framework offers a wide range of 
specific tools for dealing with these potentially infinite and online arriving data. An 
overview of recent contributions is available in [2]. 

An emerging challenge, in this context, is the monitoring of the spatial depen- 
dence among sensor data. The First Law of Geography, also frequently known as 
Tobler Law [5], states that ’everything is related to everything else, but near things 
are more related than distant things”. This law finds its major developments in Geo- 
statistics but is still valid in the framework of data stream mining, when the data is 
collected by spatially located sensors. For instance, surface air temperatures streams, 
are more likely to be similar when measured at nearby locations rather than in dis- 
tant places. 

Measuring the spatial dependence among fast and potentially infinite data streams 
is a very challenging task. This is due to a set of stringent constraints: 1) the available 
time for processing the incoming observations is small and constant; ii) the allowed 
memory resources are orders of magnitude smaller than the total size of input data; 
iii) only one scan of the data is feasible; iv) the communication between the sensors 
should be very limited. 

This paper introduces a new strategy for monitoring the spatial dependence over 
time which adapts the classic Moran’s index to the challenge of functional data 
stream processing. 

We assume that sensors do not communicate with each other but only with a 
central node. Thus, a first part of the processing is performed at the sensors while a 
second part is performed at the central computation node using the output of the sen- 
sors. In particular, each data stream recorded by a sensor is processed individually 
through two summarization steps. The first one, splits the incoming data stream into 
non overlapping windows and provides a compact representation of the observation 
in each window. The second step, performs on each data stream a CluStream ([1]) 
algorithm adapted for working on functional data subsequence. CluStream groups 
the incoming data into homogeneous micro-clusters and represent these through 
prototypes. 
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With the flowing of data, each sensor performs two kinds of data transmission 
to the central computation node. The first one is a snapshot of the micro-cluster 
centroid at predefined time stamps. The second one, which is performed at each 
windows, consists in sending the identifier of the micro-cluster to which the subse- 
quences have been allocated. In this way, the communication between the sensors 
and the central node requires a low bandwidth as well as low memory resources. 
Only few micro-cluster prototypes are stored for each data stream at the central 
node and the sensor data are replaced by the micro-cluster centroid to which they 
have been allocated by the CluStream. 

The central processing node is, still, used for measuring the spatial dependence 
among the streams computing the Moran’s index on the micro-cluster centroids. 

The next sections provide the details of the processing setup. 


2 Sensor data summarization through on-line clustering 


LetY={Yi(t),...,Y;(1),...,Yx(1)} be a set of n functional data streams. Y;(t), 1 € T, 
denotes a function defined on an interval with T C R. Each functional data stream 
Y;(t) is made by observations recorded by a sensor located at s; € S, with SC R? be 
the geographic space. 

We assume that the potentially infinite data is recorded on-line so that we can 
keep into memory only subsets of the streams. Thus, the analysis is performed using 
the observations in the most recent batch and some synopsis of the old data, no 
longer available. 

In reality, we observe the data at a grid of N points, f1,...,tw. The functional data 
analysis viewpoint may be described by the following non-parametric model: 


Y;j = filt;) + Eij (1) 


where f;(t) is the underlying signal curve, &;; is an observation noise with mean 
zero and null covariance and Y;; denote the observed noisy data, i = 1,...n and 
JN: 

We split the incoming data streams into non overlapping windows identified by 
w= 1,...,90. A window is an ordered subset of T, having size b which frames, for 
each Y;(1), a data batch Y” (t) = {X;(t) da i 

A CluStream ([1]) algorithm, suitably adapted for working with the functional 
subsequences Y,” (t) of the data stream Y;(t) is used for providing a fast to compute 
summarization of the stream (more details will be provided in the extended version 
of this manuscript). 

The intuition that underlies the method, is to represent the incoming data trough 
the center of low variability (micro)clusters. In order to have a high representativity 
of the input data, the number of clusters to keep updated is not specified apriori 
but only a threshold on their maximum number is fixed, to manage the memory 
resources. 
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As mentioned above, the data structure we use for data summarization is named 
micro-cluster. For each stream Y;(t), we keep a set UC; = {uc}, OR „UCK, ati ,uCK} 
of micro-clusters, where CÈ records the following information: 


Yi (1): the cluster centroid; 
nk : number of allocated functions; 
oF (t): Standard deviation; 

Sw: Sum of window Id; 

SSw: Sum of squared window Id. 


Whenever a new window w of data is available, CluStream allocates the subse- 
quence Y” (t) to an existing micro-cluster or generates a new one. The first prefer- 
ence is to assign the data point to a currently existing micro-cluster. If we choose 
the squared L? distance as our dissimilarity metric between two functions defined as 


T 
PYY) = | We) -Yar © 
0 
then, Y” (t) is allocated to the micro-cluster CÈ if 


PYY 0, YEO) <A) (3) 


and 
PYY), YO) <u (4) 


with k Ak’ andk=1,...,K. 

The threshold value u allows to control if Y” (t) falls within the maximum bound- 
ary of the micro-cluster, which is defined as a factor of the standard deviation of the 
subsequences in BCE. In order to take into account the functional nature of the data, 
a pre-smoothing step may be applied before clustering [3]. 

The allocation of a subsequence to a micro-cluster involves the updating of its 
information. The first update is the increasing by 1 of nk . Then, it is necessary to 
update micro-cluster centroid and standard deviation. Finally, it is necessary to up- 
date the sum and the sum of squares of the window identifiers considering the time 
window w. 

If Y” (t) is outside the maximum boundary of any micro-cluster because of the 
evolution of the data stream, a new micro-cluster is initialized setting the Y” (t) as 
centroid and nk = 1. The functional standard deviation o*(r) is defined in a heuristic 
way by setting it to the pointwise squared Euclidean distance to the closest cluster. 

The proposed procedure, performed in a parallel way on all the streams, permits 
to keep, at each time instant, a snapshot of the data behavior. This is due to the 
availability of the set of subsequences used as representatives. 
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3 Moran’s index on data stream summaries 


Moran’s index ([4]) is a widely used measure for testing the global spatial autocor- 
relation in spatial data. It is based on cross-products of the deviations from the mean 
and is calculated for the n observations of a variable X at locations i, i’, as: 


__no Yilin- 05 —*) 
Liki aie Li(xi — x)? 
where the weights a; define the relationships between locations in the geo- 

graphic area. 

Morans index is similar, but not equivalent, to a correlation coefficient. It varies 
from —1 to +1. In the absence of autocorrelation and regardless of the specified 
weight matrix, the expectation of Morans I statistic is —1/(n — 1), which tends to 
zero as the sample size increases. 

According to the processing setup introduced above, at the central computation 
node it is kept a snapshot of micro-cluster centroids of each stream. Every time a 
new window becomes available, it is possible to measure the spatial autocorrelation 
by receiving at the central node, from each data stream, the identifier of the micro- 
cluster to which the subsequence of the window has been allocated. This approach, 
allows to measure the spatial dependence of the data in a window by using the 
micro-cluster centroids rather than the raw sensor data. 

In this sense, the Moran’s index can be computed by: 


I 


(5) 
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where Y*(t) and Y%(t) are the micro-cluster centroids to which, respectively, 
the subsequences Y” (t), Yy’ (t), have been allocated and Y(t) is the average sub- 
sequence. 

The proposed Moran’s index can be used for obtaining a different measure of the 
spatial dependence at every time window w, starting from the micro-cluster identi- 
fiers sent by the sensors to the central communication node. 


I (6) 


4 Conclusions and perspectives 


In this paper we have introduced an approach for measuring the spatial autocorre- 
lation among functional data streams recorded by sensors. Since the main spatial 
dependence measures require a high computational effort, we have proposed to per- 
form a data summarization and to compute the spatial autocorrelation on the sum- 
maries rather than on the original data. Preliminary tests on simulated data confirm 
the effectiveness of the proposed summarization strategy in keeping track of the 
spatial correlation structure. 
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User query enrichment for personalized access to 
data through ontologies using matrix completion 
method 


Oumayma Banouar and Said Raghay 


Abstract Current information systems provide transparent access to multiple, 
distributed, autonomous and potentially redundant data sources. Their users may not 
know the sources they questioned, nor their description and content. Consequently, 
their queries reflect no more a need that must be satisfied but an intention that must 
be refined. The purpose of the personalization is to facilitate the expression of users’ 
needs. It allows them to obtain relevant information by maximizing the exploitation 
of their preferences grouped in their respective profiles. In this work, we present a 
matrix completion method that minimize the nuclear norm to construct our users’ 
profiles. Then we expose their query enrichment process expressed in SPARQL to 
interrogate data sources described by ontologies. 


Abstract Attuali sistemi informativi permettono un più facile accesso a fonti di dati 
multiple, distribuite, autonome e potenzialmente ridondanti. I loro utenti potrebbero 
non conoscere le fonti hanno messo in discussione, né la loro descrizione e ne’ il 
contenuto. Di conseguenza, le query riflettono non più un bisogno che deve essere 
soddisfatto ma un'intenzione che deve essere raffinata. Lo scopo della 
personalizzazione è quello di facilitare l'espressione delle esigenze degli utenti. Esso 
consente loro di ottenere informazioni pertinenti, massimizzando l’espressione delle 
loro preferenze raggruppate nei rispettivi profili. In questo lavoro, presentiamo un 
metodo per il completamento di una matrice che minimizza la norma nucleare per 
costruire i profili dei nostri utenti. Poi, si mostra il processo di arricchimento delle 
query espresse in SPAROL, per interrogare le fonti di dati descritti da ontologie. 


Key words: Personalization, User profile construction, Matrix completion, 
Enrichment, Ontologies. 
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1 Introduction 


The multiplicity of data sources, their scalability and the increasing difficulty to 
control their descriptions and their contents are the reasons behind the emergence of 
the need of users’ requests personalization. A major limitation of these systems is 
their inability to classify and discriminate users based on their interests, their 
preferences and their query contexts. They cannot deliver relevant results according 
to their respective profiles[1]. Consequently, the execution of the same request 
expressed by different users over an ontology-based mediation system will necessarily 
not provide the same results. We will talk here about a personalized access to data 
sources using ontologies. A user accessing an information system with the intention 
of satisfying an information need, may have to reformulate the query issued several 
times and sift through many results until a satisfactory, if any, answer is obtained. 
This is a very common experience. A critical observation is that: different users may 
find different things relevant when searching, because of different preferences, goals 
etc. Thus, they may expect different answers to the same query. The personalization 
of a query uses the user profile to rephrase his request by integrating elements of his 
interests or his preferences. Storing user preferences in a user profile gives a retrieval 
system the opportunity to return more focused, personalized and hopefully smaller 
answers. 

The objective of the query personalization process is to enhance the user query 
with his related preferences stored in his profile. This process focuses on the system 
user, enables the exploitation of what is called personal relevancy[2] instead of 
consensus relevancy. In the first one, the information system computes relevancy 
based on each individual's characteristics, unlike the second one where it presumes 
that the relevancy computed for the entire population is relevant for each user. This 
work presents a matrix completion method based on the optimisation of the nuclear 
form of the matrix that represents the preferences of our users over items. It then 
enrich the user query expressed in SPARQL to be evaluated over ontologies. 

The remaining of this paper is organised as follows. Sections 2 and 3 present our 
proposed approach where section 4 discusses our experimental results. Finally, we 
conclude by exposing the next challenges for data management using learning 
methods in an information retrieval context. 


2 Proposed approach 


In our work, the enrichment and the rewriting process are dependent. These two 
algorithms add predicates to the user query. Profile predicates for enrichment and 
semantic links for rewriting.[1] It identifies the contributing sources in the execution 
of the user query and uses their definitions to reformulate it. The user expresses his 
query according to the terms of a global schema that procures a transparent access to 
multiple data sources. The rewriting process transforms it in order to evaluate it on 
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the different data sources schemas. However, the user query rewritten is an enriched 
one. It contains predicates reflecting the preferences of the user over the research 
domain. The user profile regroups these preferences. 


£(Qu/Pu) 


| 


wi 


Figure 1: Enrichment-rewriting process 


With: Pu: User profile, Qu: User query, Vs: Data sources descriptions, Qu' : Enriched 
user query, W,* : Enriched user query rewritten. 

The first process -the personalization process (enrichment process)- integrates 
elements of the centre of interests or preferences of the user in his query. Based on 
[2], the user profile is composed of a set of weighted predicates. The weight of a 
predicate expresses its relative interest to the user. It is specified by a real number 
between 0 and 1. The phase of profile construction should rely on a machine learning 
method. 

Once the user query enriched the second process, take place. It is the query 
rewriting process. It depends on the way the system defines mappings. Our system 
adopts the Local-As-View approach. It defines each sources relationships by a query 
in terms of the virtual or global schema obtained by a global ontology. This mediation 
approach facilitates the incorporation of the dynamicity sources. Indeed, changing a 
source means changing a single query. 

Each data sources integrated in the system disposes of a local schema describing 
its structure and content. This schema is obtained through a local ontology. The 
adoption of a Local- As-View approach for mediation assumes that the system defines 
every data source according to the terms of the global schema procured by a global 
ontology. This system frees the user from having to locate sources that are relevant to 
a query; interacts with each one in isolation, and combines data from multiple sources. 
The users do not ask queries in terms of the schemas in which the data is stored, but 
in terms of the mediated (global) schema. The mediated schema is a set of relations 
designed for a specific data integration application, and contains the salient aspects of 
the domain under consideration. The tuples of its relations are not actually stored in 
the data integration system. Instead, it includes a set of sources descriptions that 
provide semantic mappings between the relations in the sources’ schemas and the 
relations in the mediated schema. An ontology-based mediator purveys information 
and provides mutual access to knowledge. [1] 

The users’ profiles construction is the key personalization enabler and the useful 
tactic in data integration tasks dealing with irrelevancy problem. It takes elements of 
the user preferences as input and determines his profile as output. A user profile is a 
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set of weighted elements that defines preferences of its owner over items. Machine 
learning approach enable the possibility to manipulate profiles automatically as much 
as possible. 


User 


Reply | Query 


Global schema = Global cetology 


Figure 2: Ontology-based mediation architecture 


3 Matrix completion for personalized access to data 


Our proposed enrichment query process relies on the three main following steps: 
- A learning process to identify users and preferences clusters; A predictive method 
using clusters found in step 1: A user query enrichment using the predicted 
preferences in step 2. 

We starts our learning process by a Singular Value Decomposition of our matrix 
R(m,n) modelling the users of our systems as its rows and the preferences as its 
columns and its transpose R’. The objective is to reduce the dimensionality of the 
matrix. This process apply the K-means algorithm twice. The first one, to obtain the 
users clusters while the second to obtain the preferences clusters. After the clustering 
of the items and users, the prediction process starts in the aim to complete the ratings 
given by a user for a corresponding item. For a given user, respectively an item, we 
identifies clusters in which the selected user, respectively the preference, belongs. The 
predicted score or rate is the result of Singular Value Thresholding SVT algorithm 
[3] applied on the matrix containing rates that users in the selected user cluster given 
to preferences in the selected preference cluster . It has as an objective function 

Minimize ||R||* + myl[R|?r 
Subject to Qx X=b 
Where: ||R||- is the nuclear norm that sums the amplitude of the nonvanishing singular 
values and ||R||r is the Frobenius norm of the matrix R. 
The adopted algorithm takes as parameters three mandatory elements. 
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- Q the set of locations corresponding to the observed entries. It might be defined in 
three forms. The first one as a sparse matrix where only the elements different of 0 
are to take into account. The second one as a linear vector that contains the position 
of the observed elements. And the third one where Q is specified as indices (i,j) with 
(i,j) E N. 

- b the linear vector which contains the observed elements. 

- m, the smoothing degree. 

The application of the SVT algorithm in blocks procures in some cases certain 
results that are out of range. In this case, we use an aggregation process to predict the 
following rates. It is equal to the mean of all rates found by intersection between the 
cluster to which the user belongs and the cluster that contains the preference. For each 
user, the preferences with their weights are classed to select the K predicates with the 
heights weights to integrate to the initial user query for its evaluation over an ontology 
describing a data source. 


4 Experimental results 


The evaluation of the approach is done using the MovieLens dataset. The 
MovieLens dataset consists of: 1- 100 000 ratings from 943 users on 1682 films from 
1 to 5. 2- Each user has rated at least 20 movies. 3- The data sets are 80% 20% splits 
into training and test data. 

We performed the first step of our approach to detect 10 clusters, 5 for users and 
another 5 for films according to the rating scale. This step has as complexity of O(nkt) 
where n refers to the number of data objects while t is the number of iteration, k of 
course is the number of classes generated. The second step that corresponds to the 
predictive method allowed us to recover the initial matrix R as the matrix R the which 
dimension is 943x1682 from only 100 000 known data that corresponds to almost 
6.5% of global data. In the objective to demonstrate the efficiency of the combination 
between the aggregation method and the SVT algorithm per blocks, we applied 
several methods of Low-Rank Matrix Recovery and Completion over our 
experimental data. These methods minimize the nuclear norm of our users-preferences 
matrix in the aim to recover the missing data with a precise rank. We cite Augmented 
Lagrange multiplier method ALM [4], Accelerated Proximal Gradient method 
APG[5], Dual Method DM [6] and Fixed-Point Continuation method FPC[7]. Only 
SVT, FPC and ALM algorithms recovered the matrix with the desired rank 943. 

The following table presents a comparison of the results obtained according to 
four metrics: Mean Absolute Error MAE, Root Mean Square Error MSE, Relative 
recovery error, Relative recovery in the spectral. 
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Table 1: Experimental results 


Method MAE RMSE Ei E2 

Proposed approach 1.105209e-02 1.822554e-02 1.451178e-01 3.788417e-02 
SVT algorithm 3.045597e-02 3.570902e-02 2.031278e-01 1.207209e-01 
FPC algorithm 3.563525e-02 4.086687e-02 2.173032e-01 1.294834e-01 
ALM algorithm 7.431944e-02 1.978355e-01 4.781151e-01 2.570103e-01 


5 Conclusion 


A major limitation of the ontology based information retrieval systems is their 
inability to deliver pertinent results according to the users preferences. Indeed, they 
depend on the users’ queries, which are insufficient for giving a complete picture 
about what the users are looking for. In fact, these systems return the same result 
regardless of who submitted the query. In addition, the same user query is not 
essentially the same intent. In this work, we presented a construction profile process 
that is considered to enrich the user query expressed in SPARQL. It is based on three 
main steps wish are: A learning process to identify users and preferences clusters. A 
predictive method using clusters found in step 1 that is based on the SVT algorithm 
for nuclear norm minimization and an aggregation function. A user query enrichment 
using the predicted preferences in step 2. The next challenge now is to increase the 
pertinence and the precision of information retrieval process by updating 
automatically the profile after each user-system interaction. We talk here about two 
operations namely: the user construction from a small-observed set of data and the 
user profile overloading after a modification of a user preference. 


References 
[7] E. T. HALE, W. YIN and Y. ZHANG, “Fixed-point continuation for 11-minimization: Methodology 
and convergence”, SIAM Journal on Optimization, Vol. 19, no 3, p. 1107-1130, 2008. 
2] G. Koutrika and Y. Ioannidis, “Personalizing queries based on networks of composite preferences”, 
ACM Transactions on Database Systems, Vol 35, pp. 1-50, 2010. 
3] J. Cai , J.E. Candés, C. Zuowei, “A singular value thresholding algorithm for matrix completion”, SIAM 
Journal on Optimization, Vol. 20, no. 4, pp. 1956-1982, 2010. 
1] O. Banouar and S. Raghay, “User profile construction for personalized access to multiple data sources 
hrough matrix completion method”, IJCSNS, vol. 16, no. 10, pp. 51-57, 2016. 
4] Z. Lin, M. Chen, L. Wu, and Y. Ma, “The Augmented Lagrange Multiplier Method for Exact Recovery 
of Corrupted Low-Rank Matrices,” UIUC Technical Report UILU-ENG-09-2215, November 2009. 
5] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma, “Fast Convex Optimization Algorithms for 
Exact Recovery of a Corrupted Low-Rank Matrix”, UIUC Technical Report UILU-ENG-09-2214, August 
2009. 
6] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma, “Fast Convex Optimization Algorithms for 
Exact Recovery of a Corrupted Low-Rank Matrix”, UIUC Technical Report UILU-ENG-09-2214, August 
2009. 


The Trieste Observatory of cardiovascular 
disease: an experience of administrative and 
clinical data integration at a regional level 
L’Osservatorio delle malattie cardiovascolari di Trieste: 


un’esperienza di integrazione di dati amministrativi e 


clinici a livello locale 
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Abstract The Trieste Observatory of Cardiovascular Diseases has been established 
in 2009 with the aim to integrate administrative and clinical data sources in order to 
conduct epidemiological studies based on real-world population. Our interests are 
focused on two main areas: from an epidemiological point of view, the aim is 
minimizing various source of bias in the design of the study, that commonly could 
arise in observational settings and are moreover a crucial point when using 
administrative data sources. Methodological research comparing different 
approaches to the analysis of longitudinal measures and their impact on time to event 
data, considering recurrent and competing events, represents the specific statistical 
domain of interest. 

Abstract L'Osservatorio delle malattie cardiovascolari di Trieste è stato istituito 
nel 2009, con l’obiettivo di integrare fonti di dati amministrativi e clinici per 
condurre studi epidemiologici basati su popolazioni del mondo reale. Il nostro 
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116 Giulia Barbati and Francesca leva 
interesse è focalizzato in due aree: dal punto di vista epidemiologico, minimizzare le 
possibili distorsioni nel disegno dello studio, particolarmente frequenti in ambito 
osservazionale ed in particolare utilizzando fonti amministrative. Dal punto di vista 
statistico, l’interesse è centrato sul confronto tra metodologie appropriate per 
l’analisi di dati longitudinali nel contesto di dati di sopravvivenza ed eventi 
ricorrenti e rischi competitivi. 


Key words: Administrative Health Data, Epidemiology, Cardiovascular Diseases, 
Population Attributable Fraction, Multi-State Models 


1 Introduction 


At the Outpatient Clinic of the Cardiovascular Center and at the Cardiovascular 
Department of the University Hospital of Trieste, the “Trieste Observatory of 
Cardiovascular Diseases” has been established in 2009. The Observatory is based on 
the integration at a regional level of administrative and clinical data sources. 

The administrative source is the Regional Epidemiological Repository (RER) of the 
Friuli-Venezia-Giulia region, that includes several databases such as: Registry of 
Births and Deaths, Hospital Discharge (SDO), Laboratory tests performed in the 
public hospital as in- or out-patients, reimbursable by the National Health System 
(NHS), Public Drug Distribution System database, District Healthcare Services 
(Intermediate and Home Care). The clinical source is the electronic-chart (Electronic 
Health Recording, HER, Cardionet®) a cardiological medical record that includes 
medical information and history as collected by cardiologists during routine clinical 
practice (both in ambulatory visits and during cardiological hospitalizations). 
Diagnostic codes, laboratory tests, procedures (as for example echocardiography and 
electrocardiogram) and cardiological drugs prescription are recorded at each visit. 
Medical records are routinely reviewed by clinicians in each clinical evaluation to 
update medical history, diagnostic procedures, and treatment. This integrated 
database covers the Trieste population, i.e., 237.000 inhabitants. The RER is 
implemented in a SAS system, and the e-chart Cardionet® has been fully integrated 
in the same Data Warehouse via a dynamic single anonymous identification code. 
Previously, data were extracted by means of ad-hoc queries in a Business Object 
system, with the possibility to identify patients. Of note, population-based research 
of cardiovascular diseases is feasible in the area of Trieste because public health 
system care is largely dominant (87.1% of all cardiovascular ambulatory clinical 
evaluations, based on administrative reports). This is one of the first attempt in Italy, 
to the best of our knowledge, of a systematic integration between administrative and 
EHR system at regional level. 
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2 Applications 


We have focused our interest on evaluating characteristics and outcome of “real- 
world” Heart Failure (HF) patients. HF prevalence steeply increases with aging, 
from less than 1% in the population aged between 20 and 39, to more than 20% in 
individuals over 80 [1, 2]. Population based studies report that one-year mortality 
rate ranges from 35% to 40% [3, 4] and more than 50% of patients are readmitted to 
hospital between six and twelve months after the first diagnosis [5]. In this 
epidemiological scenario, elders with HF are representative of the growing segment 
living longer with chronic health conditions prone to multiple transitions from 
hospital to home that negatively affect their quality of life and consume substantial 
healthcare resources [6]. Moreover, HF is a highly clinical variable syndrome that 
occurs across the entire range of left ventricular ejection fraction (LVEF), from 
patients with preserved LVEF (HFpEF: LVEF > 50%) to those with reduced LVEF 
(HFrEF: LVEF < 50%). In line with the aging of population, there is an increase in 
concomitant noncardiac conditions affecting chronic HF patients. These 
comorbidities frequently complicate management and may contribute to adverse 
outcomes. Given this public health issue, there is an urgent need to rearrange HF 
healthcare systems in order to improve evidence-based practice and create seamless 
care systems. To this purpose, we recently conducted two studies using data from the 
Observatory: in the first one, the aim was exploring the differential prevalence and 
the attributable risk of noncardiac comorbidities on outcomes between HFrEF and 
HFpEF patients in a large contemporary, community-based population. 

In the second one, we investigated clinical factors contributing to lengthen hospital 
stays and to increase multiple readmission rates to both hospitals and community 
services as Integrated Home Care (IHC) activations and Intermediate Care Unit 
(ICU) admissions. 


2.1 Prevalence and Prognostic Impact of Noncardiac 
Comorbidities in Heart Failure Outpatients: selection of the 
cohort, covariates and outcomes definition 


The cohort of consecutive ambulatory HF patients that attended the Outpatient Clinic of the 
Cardiovascular Center of Trieste between November 2009 and December 2013 was selected, 
and the first visit in the period was considered as a starting point (index visit). To identify the 
cohort we used firstly the EHR with diagnostic codes presenting clinical findings compatible 
with HF. The diagnosis of HF was made by using criteria of the European Cardiology Society 
[7]. Patients were divided into two groups according to the LVEF: preserved (LVEF > 50%) 
and reduced (LVEF<50%). Clinical variables, including cardiac and noncardiac 
comorbidities, were determined according to the data of EHR integrated with diagnoses based 
on ICD-9CM derived from previous hospitalizations, laboratory data, and/or specific 
treatment of chronic illnesses. On the basis of the Charlson comorbidity index [8], we 
included the following noncardiac comorbidities: peripheral artery disease (PAD), 
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cerebrovascular accident, dementia, chronic obstructive pulmonary disease (COPD), 
rheumatological disorders, acquired immunodeficiency syndrome, peptic ulcer disease, 
diabetes mellitus, liver disease, malignancy, chronic kidney disease (CKD), psychiatric 
disorders, and anemia. According to Ather et al [9], we also included obesity and 
hypertension, because of their prognostic significance in HF patients. For each patient, the 
total number of comorbidities was calculated. Body mass index was calculated as the ratio of 
weight to square height (kg/m?), and obesity was defined if the body mass index was > 30 
kg/m’. Hypertension was defined with a systolic blood pressure of >140 mmHg and/or a 
diastolic blood pressure of >90 mmHg at the index visit, and/or with a history of 
hypertension. Renal failure was defined in case of an estimated glomerular filtration rate 
(GFR) < 60 ml/min/1.73m?. Anemia was defined according to World Health Organization 
criteria (Hb <13gr/dL in men and 12 g/dL in women). Study outcomes of interest included 
death from any cause, all-cause hospitalization, HF hospitalization, and noncardiovascular 
hospitalization. Deaths were collected from the regional Registry of Birth and Deaths. First 
all-cause hospitalization, HF hospitalization, and noncardiovascular hospitalization were 
collected from the Hospital Discharge Registry. The principal discharge diagnosis for each 
hospitalization was assessed using primary ICD-9CM code, which is assigned by clinical 
personnel after discharge, and reflects the main reason for admission. The administrative 
censoring date was December 31th, 2014. 


2.2 Prevalence and Prognostic Impact of Noncardiac 
Comorbidities in Heart Failure Outpatients: statistical methods 


To examine the relationship between noncardiac comorbidities and outcomes, we estimated 
population attributable fractions (PAF) of each noncardiac comorbidity expressed by 
percentage in the overall HF population and in the LVEF subgroups. This proportion of 
incidence would not occur if the factor were eliminated [10]. The estimated PAF was reported 
with corresponding 95% confidence intervals (CIs). The unadjusted PAF in the exposed 
group (PAFexp) was calculated using the following formula: PAFexp = (RR-1)/RR, where RR 
is the relative risk of event computed for the exposed group with respect to the unexposed 
group. The unadjusted PAF in the overall population (PAFpop) was calculated using the 
following formula: PAFpop =p* (RR-1)/(p* (RR-1)+1), where p is the prevalence of 
exposure in the population. The corresponding adjusted estimates for both measures were 
derived from a logistic regression model adjusted for age and sex. In order to assess the 
interaction between LVEF groups and comorbidities (both individually, and as a sum of 
comorbidities per patients) hazard ratios of the interaction terms in Cox models adjusted for 
sex and age were calculated. To examine the effect of comorbidity load on all-cause mortality, 
HFrEF and HFpEF populations were divided into groups with different comorbidity loads 
(absence, 1, 2, >3 comorbidities); event curves for each comorbidity group were estimated 
using the Kaplan-Meier method within each LVEF group. Covariates for multivariable 
models of mortality were selected on the basis of a backward stepwise algorithm in a Cox 
proportional hazards model. The model included demographic, medical history, laboratory 
values, and the interaction between comorbidity load (absence, 1, 2, >3) and LVEF groups. 
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2.3 Multi state modelling of Heart Failure care path: selection of 
the cohort, covariates and outcomes definition 


Between 2009 and 2014, a cohort of patients discharged with HF diagnosis hospitalized in the 
Trieste area was identified. HF diagnosis included ICD-9CM codes for HF (428:x) and 
hypertensive HF (402:01, 402:11, 402:91) according to the National Outcome Evaluation 
Program (in Italian PNE-Programma Nazionale Esiti) made by the National Agency of 
Regional Health-Care Services (in Italian AGENAS - AGEnzia NAzionale per i servizi 
Sanitari regionali). Patients were classified as Worsening Heart Failure (WHF) or De Novo 
on the base of the presence of at least one HF hospitalization in the 5 years preceding the 
index admission, which is the first admission during the study period. The administrative 
censoring date was September 30th, 2015. For each cohort member, data included gender and 
age, length of stay, department of admission and discharge, diagnostic code at discharge, stay 
into Emergency/Intensive Care Units during the hospitalization, cardiological evaluation 
before the hospitalization (when performed), laboratory tests, LVEF (when performed). The 
Charlson Comorbidity Index was calculated using hospital diagnosis based on ICD-9CM that 
occurred within five years before the first admission and integrated with laboratory data and 
diagnosis recorded at the index admission. In particular, for the diagnosis of diabetes mellitus 
we integrated information about glycosylated haemoglobin at admission and the recorded 
diagnosis of diabetes mellitus in the previous 5 years. Similarly, to assess the presence of a 
chronic kidney disease, we integrated the creatinine value at admission to compute the 
estimated glomerular filtration rate (EGFR) < 60 ml/min with the reported diagnosis of a 
chronic kidney disease in the previous 5 years. Study outcomes of interest included death for 
any cause, all-cause rehospitalization, and transitions in IHC/ICU. Deaths were collected 
from the regional Registry of Birth and Deaths. All-cause hospitalizations and admissions in 
IHC/ICU were collected respectively from the Hospital Discharge Registry and the District 
Healthcare Services database. The principal discharge diagnosis for each hospitalization was 
assessed using primary ICD-9CM code. Each cohort member was followed from the starting 
date (i.e. discharge from the index admission) until the end of the study or the date of death. 


2.4 Multi state modelling of Heart Failure care path: statistical 
methods 


The first multi-state model (hereafter Model 1) replicates a dynamic similar to the one 
described in [11] for repeated hospitalizations only (we are omitting community services in 
this case), i.e. a multi-state model fitting a cox-type regression for each transition. It provides 
a convenient description of the admission-discharge dynamics, pointing out which covariates 
act in certain transitions and how they affect the relative risk as well as the risk (i.e. the 
instantaneous probability) of moving from one status to another one. This model accounts for 
patient specific risk profile (distinguishing covariates acting on different transitions) as well 
as clinical information. The second model (hereafter Model 2) is still a multi-state model 
where patients are assumed to be in one of the following five states: in hospital, in ICU, in 
IHC, out (of hospital, or ICU or IHC) and dead. Through this model, we would like to detect 
what is the impact of patient characteristics on the risk of moving among these states. Both 
models include the adverse outcome of death as absorbing state. The death of a patient is a 
competing event with respect to all the other transitions. 
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4 Results 


4.1 Prevalence and Prognostic Impact of Noncardiac Comorbidities 
in Heart Failure Outpatients 


A total of 2772 patients met the pre-defined HF criteria during the study period. Of these, 209 
(13%) patients were excluded because quantitative LVEF had not been documented, and 98 
(4%) were excluded because of left side severe primary valvular disease. Thus, a total of 2314 
patients met study selection criteria. Of these, 1373 (59%) patients were identified as HFpEF 
(i.e., LVEF >50%) and 941 (41%) patients were identified as HFrEF. Overall, mean age was 
77+10 with a substantial proportion of female patients (43%), significant background 
prevalences of ischemic heart disease (46%), hypertension (80%), and atrial fibrillation 
(54%). During median follow-up of 31 [IQRs 16 - 41] months, 472 (20%) patients died. 
Overall, there was a high morbidity burden, with first hospitalizations from any cause in 1533 
pts (66%), hospitalizations for HF in 510 (22%), hospitalizations for noncardiovascular cause 
in 1422 (61%). Among all noncardiac comorbidities, anemia, CKD, COPD, diabetes mellitus, 
and PAD were all strongly associated with mortality in the overall HF population (adjusted 
HR [95% CI]: anemia=1.9 [1.5-2.4]; CKD=1.7 [1.3-2.1]; COPD=1.6 [1.3-1.9]; diabetes=1.4 
[1.2-1.7]). Similar findings were observed for all-cause hospitalization, noncardiovascular, 
and HF hospitalizations (data not shown). Considering PAF (PAFexp and PAFpop) for all- 
cause mortality, anemia, CKD, diabetes mellitus, COPD, showed the highest quantitative 
contribution (PAFpop [95% CI]: anemia=14% [10-17]; CKD=17% [11-23]; COPD=14% 
[11-16]; diabetes=11% [8-15]). All other noncardiac comorbidities showed a PAF below 
10%. Findings were similar for all-cause hospitalization, with exception of PAD which 
showed a high contribution only for all-cause hospitalization. For each LVEF groups, the 
noncardiac comorbidities presented similar quantitative contribution (data not shown). 
Concordantly, for all-cause mortality, noncardiac comorbidities had no significant 
interactions by LVEF, confirming no differences in their prognostic impact. This was 
confirmed to be similar for all-cause, HF, and noncardiovascular hospitalizations (data not 
shown). When we grouped HF patients according to comorbidities burden, the presence of >3 
comorbidities was related with increased risk (HR 2.3, 95% CI: 2.1-3.5) for all-cause 
mortality. This trend was similarly observed in both LVEF groups (p=0.81 for interaction). In 
multivariable Cox models, an increasing number of noncardiac comorbidities was associated 
with a higher risk for all-cause mortality (HR 1.25; 95% CI 1.1 - 1.3), all cause 
hospitalization (HR 1.17; 95% CI 1.12 - 1.23), HF hospitalization (HR 1.28; 95% CI 1.19- 
1.38), noncardiovascular hospitalization (HR 1.16; 95% CI 1.1 247 1-1.22). The 
multivariable Cox model revealed no significant difference in mortality rates between LVEF 
groups (HR 0.95; 95% CI: 0.63 to 1.42). This trend was confirmed also for morbidity 
outcomes (data not shown). 


4.2 Multi state modelling of Heart Failure care path 


A total of 4904 patients hospitalized with primary HF diagnosis between 2009 and 2014 were 
identified. The mode of clinical presentation of patients was De Novo HF, 4129 patients 
(84%), and WHF, 775 patients (16%). Overall, the mean age was 81+10 with a substantial 
proportion of female patients (55%), and significant background prevalences of non cardiac 
comorbidities. 2923 (71% out of 4129) De Novo patients had a previous hospitalization for 
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any cause. Indeed, more than half of the cohort (61%) had a renal disease and the median of 
LVEF, when recorded, was 53% (30% with LVEF < 40%; 13% with LVEF 40-49%; 57% 
with LVEF >50%). Comorbidity burden was high, with the median of Charlson score of 2 
(40% with Charlson score >3). The rate of admission in cardiological ward (CW) was 23% at 
the first hospitalization. The median follow-up was 26 months, IQR [11-48]. In Model 1, a 
significant effect of aging and increasing of comorbidity burden on the rehospitalization risk 
was observed. Likewise, a relevant impact of WHF was observed in all readmission rates. No 
significant role of gender emerged. The probability of being discharged from hospital, i.e. 
shortening the Lenght Of Stay (LOS) was inversely related to age and Charlson score. 
Conversely, a direct relation with admission in cardiological ward was observed. When we 
considered the effect of covariates on risk of mortality related to readmission, the 
hospitalization in cardiological ward was protective up to the second hospitalization, while, 
after that, this effect was nullified. Ongoing, the aging and increasing of Charlson score were 
still associated with in and out of hospital death through all readmissions, whereas almost any 
adverse effect on death was observed for clinical condition of WHF. In Model 2 we observed 
that the aging and higher Charlson index increased the risk of being readmitted to hospital, 
ICU and IHC. When we considered the covariates effect on time spent in different states of 
Model 2, aging process was directly related to time spent in hospital, while it was inversely 
related to time spent in IHC. Furthermore, we noticed a higher risk of admission in ICU for 
female patients. WHF condition increased the risk of being readmitted to hospital and it 
behaved as a protective factor for death outside. 


5 Conclusions and Perspectives 


In the first study, we confirm in a contemporary community-based population that 
noncardiac chronic illnesses confer significant risk for mortality and hospitalization 
in HF patients. For the first time, we demonstrate the effect of noncardiac 
comorbidities, by estimating associated attributable risks in an HF community setting 
within each LVEF phenotype. Remarkably, the adverse impact of noncardiac chronic 
diseases appears similarly significant, irrespective of LVEF groups. Of all individual 
noncardiac comorbidities, CKD, anemia, diabetes mellitus, COPD, and PAD showed 
the highest significant association with mortality and morbidity. 

In the second application, for the first time, we demonstrate the effect of certain 
clinical conditions in a community setting on multiple readmissions by including 
intermediate care states (IHC/ICU). These findings significantly enhance our 
understanding of clinical pattern of patients with HF for adverse prognosis and have 
implications for the management approach to HF patients. From a policy 
perspective, identification of patients at high risk for multiple readmission must 
encourage the implementation of appropriate preventable intervention strategies 
[12]. Future developments in data sources integration will include the individual 
linkage of socio-economic indicators from the ISTAT Census 2011 at a regional 
level, and the development of persistence/adherence indicators to the therapy 
prescribed at the cardiological visits and after episodes of hospitalization. From an 
epidemiological point of view, efforts will be addressed in minimizing various 
source of bias in the study design, that commonly could arise in observational studies 
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and are moreover a crucial point when using administrative data sources. From a 
statistical point of view, methodological research comparing different approaches to 
the analysis of longitudinal measures and their impact on time to event data, 
considering recurrent/competing events, will be the area of interest [13,14,15]. 
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Marginal modeling of multilateral relational 
events 
Modelli marginali per eventi relazionali multilaterali 


Francesco Bartolucci, Stefano Peluso and Antonietta Mira 


Abstract We implement the methodology of marginal modeling of relational events 
involving groups of actors, as developed in [3]. Current relational data analyses 
suffer from the representation of an event through edge variables, with potential 
loss of information when the events generate a set of multiple relations rather than 
bilateral connections or ties. To fully exploit the informational content of relational 
events, we model an event as a binary vector of response variables representing 
actors participating to the event. Univariate and bivariate distributions of the events 
are modeled through marginal parameters having a clear social interpretation. 
Abstract Implementiamo il metodo proposto da [3] per la modellizzazione marginale 
di eventi relazionali che coinvolgono gruppi di attori di diversa numerosità. Di- 
versamente dagli attuali metodi statistici disponibili in letteratura, nel modello 
marginale proposto, l’evento è rappresentato tramite un vettore binario che indica 
gli attori partecipanti o meno all’evento stesso. Presentiamo una parametrizzazione 
delle distribuzioni marginali univariate e bivariate facilmente interpretabile in ter- 
mini di comportamento individuale e collettivo degli attori. 
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works 
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1 Introduction 


In many relational contexts, a set of events is observed, with each event involving 
a group of actors; this gives rise to a set of multiple relations rather than to single 
bilateral connections. In these applications, the interest is not only in studying the 
relation between units, taking into account that the same event may involve more 
than two actors, but also to model the tendency of each actor to be involved in 
different events, not excluding cases of events participated by a single actor. 

Common strategies to analyze social networks originated from a sequence of 
events are based on models for edge variables of type Yj; associated to each pos- 
sible pair of actors (i, j) in a network of n actors: Y;; is equal to 1 if there is a 
connection between actors i and j and to 0 otherwise. Then, standard models for 
cross-sectional social networks, such as exponential random graph models (ERGM) 
or latent space/blocks model, may be fitted to draw conclusions [for a review see 
10]. When the edge variables are time or event specific, more sophisticated strate- 
gies are based on models for network dynamics, of which at least three approaches 
may be highlighted: actor-oriented models [11], dynamic ERGMs [9], and hidden 
Markov models [12, 2]. 

To fully exploit the information provided by each relational event, we propose an 
approach based on the direct representation of the outcome of an event e by a vector 

, ZO), with Zo equal to 1 if unit i is involved 
in the event and 0 otherwise. Following [3], we then formulate a statistical model for 
the response vectors Z), with parameters having meaningful interpretations from 
a social behavior perspective. In particular, the distribution of Z) is parametrized 
through a marginal model [1, 4] based on the specification of its univariate and 
bivariate marginal distributions. The parameters involved in the univariate marginal 
distributions represent the general tendency of an actor to be involved in an event. 
The parameters involved in the second-order marginal distributions account for the 
concordance between behaviors of two actors, that is, the tendency to be jointly 
involved (or not) in the same event. 

In our approach, we take advantage of the availability of a typically long series of 
events to estimate in a reliable way individual parameters associated to the actors in 
the network. In particular, we rely on a fixed-effects composite likelihood approach 
[8], which is rather straightforward to use even with large networks. 

We pay particular attention to the representation of the results. In this regard we 
introduce parametrizations that allow us to represent trajectories of the tendency 
to be involved in an event and of the concordance of behaviors with other units, 
highlighting their evolution over time. Furthermore, we can express the association 
parameters based on the euclidean distance between two actors in terms of suitably 
estimated subject-specific latent vectors. Then, a “map” may be visualized, with 
close units characterized, as in the latent space model of [7], by a high chance to be 
tied. 


of response variables Z(® = (z, ste 
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2 Model Description and Inference 


In the present section we briefly outline the modeling framework introduced in [3]. 
Let n be the number of actors and r the number of events, with the binary vari- 
able zo and the binary vector Z‘°) defined in Section 1. Also define the bivariate 
probability vector 


G20 
(e) _ p(Z = 0, 
Pi; = (z9 =1 

Lag 


(e) 


i 


fori=1,...,n—1, j=i+1,...,n,ande=1,...,r, with z 


; denoting the observed 
value of zo and /(-) denoting the indicator function. A marginal parametrization 
for the distribution of the random vector Z(°) is based on effects of the following 


type, for a suitable series of subsets of response variables with index in A: 


Zz) = 
log PEA =?) glam. D 
(Zi =0) 


In more detail, we specify first- and second-order effects for all the individuals and 
events as 


i=1,...,n, (2) 


which is a particular case of (1) when A = {i}, and 
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which is obtained from (1) with A = {i, j}. As already mentioned in Section 1, the 
marginal logit n and the log-odds ratio nÉ are interpreted, respectively, as a 
measure of tendency of subject i to be involved in event e and as a measure of the 


concordance between the behaviors of subjects i and j to collaborate in event e. 
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Parameter estimation is performed through numerical maximization of the pair- 
wise composite log-likelihood function: 


O= E Liw ogsp, (4) 


where 0 collects all individual parameter vectors and every probability vector pî 


is obtained from suitable elements of @ by an explicit formula elaborated by [6]. 
This numerical maximization exploits an expression for the derivative of p((0) with 
respect to @ that is rather easy to be computed. 

For large networks we also propose an estimation algorithm that explicitly groups 
individuals in separate clusters. Each cluster includes actors that are homogenous 
in terms of tendency to be involved in an event and tendency to be involved (or 
not) in the same event (concordance). This method is based on the classification 
version of the pairwise log-likelihood function defined above, formulated following 
the approach adopted by [5] in a simpler context. 


3 Application 


The approach is illustrated by some applications in which trajectories of the ten- 
dency to be involved in an event and concordance with other units are of interest, 
with special focus on the closeness between pairs of units and on its suitable depic- 
tion. 

In more detail, we apply our method to a temporal network of e-mail exchanges 
between users affiliated with a large European research institution.! The network 
was generated using anonymized information about all incoming and outgoing e- 
mails between members of the research institution. The e-mails represent communi- 
cation between institution members only, and the dataset does not contain incoming 
messages from or outgoing messages to the rest of the world. 

In our approach to analyze these data, each e-mail is seen as a multilateral event 
e involving a different number of recipients from which an appropriate realization 
of the vector Z is easily obtained assigning value 1 to the elements corresponding 
to the receivers, while all other elements are equal to 0. We also have four sub- 
networks corresponding to the communication between members of four different 
departments at the institution. The whole network counts 986 nodes, with 332,334 
temporal edges (time-stamped directed edges), spanned over 209,508 e-mails sent 
in 803 days. The relevance of studying not only bilateral relations between recipi- 
ents is manifest in our application, since more than 18,000 e-mails are sent to more 
than two addresses, and more than 170,000 e-mails have a unique address, with a 
maximum of 39 receivers per e-mail. We report the diagram for the number of recip- 


! the data are freely available at https://snap.stanford.edu/data/ 
email-Eu-core.html. 
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ients per e-mail in Figure 1. Subject heterogeneity is clear: the number of received 
e-mails per subject ranges from 0 to 4,637, with mean and median equal, respec- 
tively, to 332 and 138 e-mails, and standard deviation 478.6, clearly evidencing a 
network composed of actors with very different social behaviors. 

The application is based on two phases: fixed-effects estimation and clustering. 
In the first step, we assume polynomials of order 2 for the effect of time, and we es- 
timate the corresponding parameter vectors. This first result allows the derivation of 
trajectories in terms of tendency to be recipient of an e-mail in a certain period. Pe- 
riods can be created arbitrarily, merging sending times falling in the same specified 
period, or can be fixed to avoid the presence in the same period of e-mails involving 
multiple ties among users, or can coincide with the effective times the e-mails are 
sent. The other interpretation we can draw from the estimates of the fixed-effects 
parameters is in terms of tendency to be recipient (or not) with other users in the 
same e-mails. We are able to represent these tendencies and their evolution over 
time. 

In the second phase of our procedure we cluster the users according to their 
behaviors in terms of tendency to receive e-mails and of tendency to receive (or 
not) common e-mails. The two clusters are not necessarily overlapping, since they 
express two different features of the user: a user can be very active in receiving 
e-mails but always involving the same restricted number of collaborators. 


Number of recipients per email 
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Fig. 1 Distribution of the number of recipients per e-mail 
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New Insights on Students Evaluation of 
Teaching in Italy 


Nuove analisi sulle valutazioni degli studenti in Italia 


Bassi Francesca, Grilli Leonardo, Paccagnella Omar, Rampichini Carla and 
Varriale Roberta 


Abstract This work presents new analyses on the relationship between student eval- 
uation of teaching and student, teacher and course specific characteristics, exploiting 
the richness of information collected by a new survey carried out among professors 
of the University of Padua. Data collected in this survey are able to highlight teacher 
needs, beliefs and practices of teaching and learning. This allows to introduce in the 
study some subjective traits of the teachers. The role of these new variables in ex- 
plaining student evaluations is deeply investigated. 

Abstract In questo lavoro vengono presentate delle nuove analisi sulla relazione 
fra le opinioni espresse dagli studenti per la valutazione della qualita della didat- 
tica universitaria e caratteristiche specifiche del corso, degli studenti e dei docenti, 
sfruttando la ricchezza di informazioni raccolte per mezzo di una nuova indagine 
realizzata tra i docenti dell’ Universita di Padova. Questa indagine é in grado di evi- 
denziare i bisogni, le credenze e le pratiche dei docenti legate alle loro attività didat- 
tiche, permettendo di introdurre nelle analisi un insieme di caratteristiche soggettive 
dei docenti. Il loro ruolo viene quindi approfonditamente studiato nelle successive 
analisi. 
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1 Introduction 


Students’ opinions and judgements of teaching performances play a substantial role 
in higher education, particularly as instruments for gathering information on the 
quality of education and evaluating university courses [1, 8]. The relationship be- 
tween student-, teacher-, course-specific characteristics and student evaluation of 
teaching (SET) is the topic of a huge amount of works in the literature (see an 
extensive review provided by [6]). However, findings concerning the relationship 
between SETs and the characteristics of courses, students and teachers are some- 
times contradictory. Thus, these characteristics usually explain only a small portion 
of the total variance in SETs scores [5]. 

It is generally accepted that a multilevel analysis of the students’ ratings is a satis- 
factory approach for investigating teaching evaluations, because of the hierarchical 
nature of the data (i.e. university students nested into classes) [3]. 

This work aims at enriching the multilevel literature on the student evaluation of 
teaching proposing some original analyses based on a wider set of teacher-specific 
characteristics, including also teachers’ opinions on their teaching activities. This 
work exploits an innovative and original dataset available at the University of Padua, 
obtained after linkage of survey and administrative data coming from three different 
sources: first, the conventional survey on the student evaluation of teaching carried 
out among university students; second, administrative data related to the main fea- 
tures of the teachers and the didactic activities (DAs) they are involved; third, a 
new CAWI survey carried out by means of the research project PRODID (Teacher 
professional development and academic educational innovation). It started at the 
University of Padua in 2013, with the aim of developing strategies to support aca- 
demic teachers and enhance their teaching competences. A specific questionnaire 
was then developed and addressed to all professors involved in almost all didactic 
activities of the University. This new survey collected opinions, beliefs and needs of 
the professors, with regard to their teaching activities developed in their classes. 

This work is organised as follows. Section 2 introduces the data of this analy- 
sis, while the empirical application (model specification and results) is described in 
Section 3. Section 4 ends the paper, highlighting the main conclusions and some 
suggestions for future works. 


2 The data 


This work investigates data obtained by merging three different datasets coming 
from the University of Padua. The reference is the 2012-2013 academic year. 

The first one is the standard online survey carried out by the University to mea- 
sure students’ opinions on the didactic activities. It involves all students who have 
been attending lessons of any degree courses of the Athenaeum. Students were asked 
to express their level of satisfaction on a scale from 1 to 10 (being 1 the lowest level) 
to a set of 18 items (seven if the student attended less than 30% of the lessons). 


New Insights on Students Evaluation of Teaching in Italy 131 


The second one is the administrative dataset that collects information on the 
teachers and the didactic activities of all Padua academic institutions. 

The third one is an innovative dataset, collected by means of a new online sur- 
vey aiming at providing a picture of the teaching experiences developed in the uni- 
versity classrooms. Indeed, the University of Padua in 2013 promoted the PRO- 
DID project (Teacher professional development and academic educational innova- 
tion - ’Preparazione alla professionalità docente e innovazione didattica”) with the 
purpose of developing an integrated system to improve teaching competences and 
academic innovation. The PRODID project promoted a research-based approach to 
creating training programs, faculty learning communities, pilot experimental con- 
texts where teaching innovation could be tested and monitored ([2]). Following an 
evidence-based approach, the project aimed at highlighting teachers’ needs, beliefs 
and practices of teaching and learning, which may constitute a privileged context 
for the development of innovative teaching activities within the institution. 

The final questionnaire was developed according to the Framework of Teaching 
of [7] and was composed by three sections. The first section focuses on practices 
developed by the Padua professors in their teaching activities. The teacher is thought 
as a facilitator of the learning processes and for this reason the section asks for each 
DA (at most three) about the application (or not) of some specific practices in his/her 
activities. Eight items are collected. Six indicators are then constructed and five of 
them are obtained considering separately as dummy variables the first five items: 
implementation of practices for actively getting involved students; proposal of exter- 
nal contributions (i.e. stakeholder); monitoring students learning during the course 
by means of specific tests/other ways; assessment of students learning using vari- 
ous types of examination; modification of teaching practices according to SET. The 
sixth indicator is calculated summarising in a single dummy variable the last three 
items of the section (reporting at least one activity involving technology practices), 
since these three questions collect similar information on these practices. The sec- 
ond section deepens teachers’ beliefs about teaching in higher education. By means 
of 20 questions, in a scale from 1 (fully disagree) to 7 (fully agree), some general 
dimensions are investigated: the Person as Teacher, Expert on Content Knowledge, 
Facilitator of Learning Processes and Scholar/Lifelong Learner. Considering also 
some questionnaire validation analyses (a factor analysis in particular), six factors 
are defined (they substantially replicate the aforementioned dimensions), calculated 
as the average values of the answers within each factor. These factors may be sum- 
marised as other subjective characteristics of the teachers: 1) passion for teaching; 
ii) passion for research; iii) feeling the need of support for improving teaching activ- 
ities; iv) will to change teaching activities according to students needs; v) features 
of teaching and learning methods; vi) features of teaching and evaluation activi- 
ties. The third section focuses on teachers’ needs, that are collected through some 
open-ended questions (however, they are not exploited in this analysis). 

The PRODID questionnaire was addressed to all teaching staff of the University 
of Padua involved in any DA during the academic year 2012-2013; the response rate 
of this survey was slightly lower than 50%. 
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In this analysis we consider only students who attended at least 50% of lessons, 
involved in courses of the bachelor degree and enrolled in any undergraduate pro- 
grammes, but Medicine. In the end, we excluded courses with a number of units 
smaller than five (in order to avoid comparisons based on too few ratings). Ac- 
cording to these criteria, the linkage of the different sources led to a final dataset 
composed by 23605 complete records, based on students’ evaluations. 


3 The analysis 


The analysis of the dataset described in the previous Section is based on the es- 
timation of a multilevel random intercept model [4], where the level-1 dependent 
variable is the overall level of satisfaction (based on Item 14). Level-2 units are the 
DAs of each teacher. This choice follows from the fact that, within each course, the 
student is asked to evaluate the activities of each professor having a minimum num- 
ber of hours taught in the course. The student degree is not a further level, but it is 
controlled by means of fixed effects. The total number of level-2 units is equal to 
590, while 40 is the average number of observations per group. 

In general, the rating of a student to a given item for a certain course may de- 
pend on course-related factors (class size and heterogeneity, course difficulty and so 
on), student-related factors (gender, age and so on) and teacher-related factors (age, 
gender, personal traits and so on) [6]. According to the aims of this work and the 
features of our dataset, the set of our explanatory variables may be divided in: 


e Course characteristics: compulsory course, total number of hours, more than one 
teacher involved, location (in Padua or outside), shared course. 

Student - general characteristics: gender, age. 

Student - university career: year of enrolment, average (per year) number of 
passed exams, average grade of the exams in the referred academic year. 
Teacher - general characteristics: gender, age. 

Teacher - university career: academic position. 

Teacher - DA characteristics: proportion of the total number of hours within DA. 
Teacher - subjective characteristics: according to Section 2, the six indicators of 
teaching practices and the six factors of teacher beliefs. 


This specification allows to particularly investigate the role of objective teacher 
characteristics and the one of subjective teacher characteristics. 


3.1 Main results 


Results from the estimation of the random intercept model described in previous 
Section is reported in Table 1. 
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On the one hand, student characteristics are strongly related to the overall sat- 
isfaction rating of the DA, particularly those related to the academic experience of 
these students. The main features of the courses play a weak role instead. 

On the other hand, there are some interesting results on the relationship between 
SETs and teacher characteristics. Objective teacher traits are weakly related with 
SET ratings: age is the only variable reporting a strong statistically significant esti- 
mate (the older, the better the teacher is evaluated, ceteris paribus). Subjective fea- 
tures of the teachers are also related to SET scores, but in some particular ways. Two 
indicators of practices and even four factors of beliefs are statistically significant. In 
particular, looking at these teacher beliefs, interesting relationships appear for those 
factors related to the sensitivity and the aptitude of teaching. For instance, accord- 
ing to the PRODID questionnaire the factor Feeling the need of support to improve 
teaching activities” may highlight those teachers who feel some difficulties or in- 
adequacies in their teaching activities/performances and for this reason they need 
help from experts. Students are able to perceive such difficulties and then reporting 
a lower evaluation of the course (other things being equal). On the contrary, students 
recognise those teachers with a high passion for teaching or the will to propose suit- 
able and helpful instruments in their DAs: such traits may be able to enhance the 
transmission of knowledge from the teacher to the student. 

It is worth noting the different relationships that come to light between SET 
evaluations and the passion for teaching and passion for research dimensions. 


4 Conclusions 


Exploiting the richness of information provided by an innovative survey on teaching 
experiences and beliefs of professors working at the University of Padua, the role 
of the teacher perceptions and needs on their activities is deeply investigated. Find- 
ings clearly show that subjective characteristics of the teachers play an important 
role in explaining SET ratings. However, this solution should be improved taking 
into account the fact that the sample of professors, who completed the PRODID 
questionnaire, is likely to be not randomly selected. 

This work may be seen as a first step for enhancing the relationship between 
quality of a course (or university) and students’ opinions. Indeed, teaching is a com- 
plex and multidimensional concept, so a future research strand could be the analysis 
of a multidimensional indicator of course quality, based on a battery of items. 
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Table 1 Estimates of the random intercept model on the students’ overall satisfaction 


Group characteristics Variable Point estimate 
Course Compulsory course -0.036 
Number of hours 0.432 * 
More than one teacher -0.096 
Location of courses in Padua -0.875 * 
Shared course -0.126 
Student - general Female -0.030 
Age 0.304 *** 
Student - career Second year of enrolment -0.216 *** 
Third year of enrolment -0.140 * 
Average number of passed exams (whole career) 0.088 ** 
Average grade of passed exams (in 2012/13) 0.338 *** 
Teacher - general Female -0.169 * 
Age -0.185 *** 
Teacher - career Full professor 0.017 
Associate professor 0.077 
Teacher - DA Proportion of hours in DA 0.231 
Teacher - subjective Practices for actively getting involved students -0.110 
(practices) Proposal of external contributions 0.192 ** 
Monitoring students learning during the course 0.003 
Assessing students learning using different types of exam -0.194 ** 
Modification of teaching practices according to SET -0.038 
Reporting at least 1 activity involving technology practices 0.053 
Teacher - subjective Passion for teaching 0.128 *** 
(beliefs) Passion for research -0.049 
Need of support for improving teaching activities -0.110 *** 
Will of changing teaching activities with students needs 0.075 * 
Features of teaching and learning methods 0.137 *** 
Features of teaching and evaluation activities -0.007 
constant 6.152 *** 
ICC 21.2% 


Note: *** = 1% of level; ** = 5% of level; * = 10% of level 
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Bayesian Quantile Regression using the Skew 
Exponential Power Distribution 


Bernardi Mauro and Marco Bottone and Petrella Lea 


Abstract Traditional Bayesian quantile regression relies on the Asymmetric Laplace 
distribution (ALD) due primarily to its satisfactory empirical and theoretical per- 
formances. However, the ALD displays medium tails and is not suitable for data 
characterized by strong deviations from the Gaussian hypothesis. In this paper, we 
propose an extension of the ALD Bayesian quantile regression framework to ac- 
count for fat tails using the Skew Exponential Power (SEP) distribution. Linear and 
Additive Models (AM) with penalized spline are used to show the flexibility of the 
SEP in the Bayesian quantile regression context. Lasso priors are used to account 
for the problem of shrinking parameters when the parameters space becomes wide. 
We propose a new adaptive Metropolis—Hastings algorithm in the linear model, and 
an adaptive Metropolis within Gibbs one in the AM framework. Empirical evidence 
of the statistical properties of the model is provided through several examples based 
on both simulated and real datasets. 

Abstract L’analisi Bayesiana per la regressione quantile si basa sull’uso della dis- 
tribuzione Laplace asimmetrica come strumento inferenzale. Tale distribuzione pur 
fornendo performances soddisfacenti non ha un comportamento soddisfacente nel 
caso in cui il fenomeno sotto indagine presenti code con andamento diverso da 
quello gaussiano. In questo paper, per tener conto di code pesanti del fenomeno, 
proponiamo l’uso della distribuzione Skew Exponential Power (SEP) in un contesto 
di regressione quantile. Considereremo modelli lineari e modelli additivi attraverso 
l’uso di spline per effettuare l’inferenza bayesiana. Una distribuzione lasso a priori 
sui parametri del modello viene proposta per tener conto del problema della con- 
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trazione del numero degli stessi laddove lo spazio parametrico diventi elevato. Per 
effettuare l’inferenza bayesiana viene proposto un nuovo algoritmo adattivo di tipo 
Monte Carlo Markov Chain e analisi di simulazioni verranno proposte per validare 
il modello considerato. 


Key words: Bayesian quantile regression; Skew Exponential Power; Additive Model. 


1 Introduction 


Quantile regression has become a very popular approach to provide a more complete 
description of the distribution of a response variable conditionally on a set of regres- 
sors. Since the seminal work of [1], several papers have been proposed in literature 
considering the quantile regression analysis both from a frequentist and a Bayesian 
points of view. Specifically, let Y = (Y; , Y2, ..., Yr) be a random sample of T obser- 
vations, and X, = (1,X,1, A with ¢ = 1,2,...,7 equal to the associated 
set of p covariates. Consider the following linear quantile regression model 


Yr=XB.+%, t=1,2,...,T, 


where B, = (Br,0, Br,1,---;Be,p—1)’ is the vector of p unknown regression parame- 
ters, varying with the quantile T level. As usual, € represents the error term that, 
in the specific case of quantile regression, has the T quantile equal to zero and 
constant variance. This assumption allows us to interpret the regression line as 
the © conditional quantile of Y given the set of explanatory variables X = x, i.e. 
Q:(Y |X =x) =x’ B+. In what follows we omit the subscript © for simplicity. The 
estimation procedure of the T — th regression quantile in the frequentist approach is 
based on the minimization of the following loss 


min) pr (yı = x; B) 
7 


with p, (u) =u(t—I(u<0)). From a Bayesian point of view [8] introduces the 
ALD as likelihood function to perform the inference. For a wide and recent Bayesian 
literature on quantile regression and ALD see for example [7], and [3]. Although the 
ALD is widely used in the Bayesian framework it displays medium tails which may 
give misleading informations for extreme quantile in particular when the data are 
characterized by the presence of outlier and heavy tails. The absence for the ALD 
of a parameter governing the tail fatness may influence the final inference. To over- 
come this drawback we propose an extension of the Bayesian quantile regression 
using the Skew Exponential Power (SEP) distribution proposed by [2]. The SEP 
distribution, like the ALD, has the property of having the t-level quantile as the nat- 
ural location parameter but it also has an additional parameter governing the decay 
of the tails. Using the proposed distribution in quantile regression we are able to 
robustify the inference in particular when outliers or extreme values are observed. 
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When dealing with model building the choice of appropriate predictors and con- 
sequently the variable selection issue plays an important role. In this paper, we 
approach this problem, by considering the Bayesian version of Lasso penalization 
methodology introduced by [6] both for the simple linear regression quantile and for 
the non linear additive models (AM) with Penalized Spline (P-Spline) functions. To 
implement the Bayesian inference we propose a new adaptive Metropolis Hastings 
algorithm in the linear model, and an Adaptive Metropolis within Gibbs one in the 
AM framework for an efficient estimate of the penalization parameter and the P- 
Spline coefficients. We show the robust performance of the model with simulation 
studies. 


2 Model and Inference 


In their paper [2], the authors propose a parametrization of the SEP, that allows to 
consider the location parameter as the t--level quantile. With their parametrization 
the SEP density function can be written as: 


k(a)exp{-L(§2)"}, if ysu 


sl 
f(y,4,0,7,0) = A AE a X (1) 
Le(0)exp{-L (a) \ Jif y>&, 
where y € R, u € A is the location parameter, o € Rt and æ € (0,00) are the scale 
and shape parameters, respectively, T € (0, 1) is the skewness parameter while K = 


-1 
[zaar (1 + 1)| and I (-) is the complete gamma function. It can be showed 


that u is the T quantile and that the ALD is a particular case with œ = 1. Several 
model specifications can be obtained using the SEP likelihood by specifying a given 
function for the location parameter. 

In this paper we consider both the linear quantile regression framework 


u=pu(x)=x B (2) 


where x; is a set of exogenous covariates than the Additive Models within a robust 
semi-parametric regression framework: 


J 
u=pu(x%,2)=x B+} fi (zi) 
jal 


where x! B is the parametric component while z, = (z6,130%% 3%, 3) is an additional 
set of covariates and each f; (z;;) is a nonparametric continuous smooth function. 
To implement the Bayesian analysis we assume that f;(z;;), can be approximated 
using a polynomial spline of order d, with k + 1 equally spaced knots. 

Let’s consider more specifically the linear case where the likelihood function can be 
easily computed starting from (1) by using u as in (2). 
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The Bayesian inferential procedure requires the specification of the prior dis- 
tribution for the unknown vector of parameters & = (8,y,o0,@). Here in order to 
account for sparsity within the quantile regression model, we generalize the prior 
proposed in Park and Casella for the B parameter, assuming the hierarchical struc- 
ture given below. The prior distribution is given by: 


m(Z)=2(B | y)x(y)a(o) x(q), 


with 
p 
m(B |y) = IT (B; | 9,¥;) 
je 


OCALA: 
jl 
T(0)< FY (a,b) 

1 (a) x B (c,d) 12) (0), 


where B € RP. Here (W,@,a,b,c,d) are given positive hyperparameters and y = 
(Y1,%;---;Yp) are the parameters of the univariate Laplace distribution: 


Lı (B;|0,%;) = Le {-%jlB;1} 1-04) (Bj). 


with zero location and yj; scale parameter. Here 4, JJ and Z denote the Gamma, 
Inverse Gamma and Beta distributions, respectively. Given its characteristics, the 
Laplace distribution is the Bayesian counterpart of the Lasso penalization methodol- 
ogy introduced by [6] to achieve sparsity within the classical regression framework. 
By shrinking each regression parameter in a different way, we overcome problems 
that may arise in the presence of regressors with different scales of measurement. 
The Bayesian inference is performed by building an Adaptive Independent Metropo- 
lis Hastings MCMC algorithm using the location-scale mixture representation of 
the the Laplace distribution, see for example [9]. 


3 Simulation Studies 


We have performed several simulation studies to highlight the improvements of our 
model specification with respect to the well known ALD model tool. In particular 
the first simulation experiment is built in order to show the robustness properties 
of the proposed methodology for quantile estimation when the joint distribution of 
the couple (Y,,X,), for t = 1,2,...,7, is contaminated by the presence of outliers. 
The second study shows the effectiveness of the shrinkage effect, obtained by im- 
posing the Lasso-type prior, used when the multiple quantile linear model is of key 
concern. The last experiment aims at highlighting the ability of the model to adapt 
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to non-linear shapes, when data come from heterogeneous fat-tailed distributions. 
All of the simulation studies showed the improvement in performances of the model 
proposed in this paper with respect to the ALD quantile regression commonly used 
in literature. Here we present only the second experimental study. In particular we 
carry out a Monte Carlo simulation study specifically tailored to evaluate the perfor- 
mance of the model when the Lasso prior is considered for the regression parame- 
ters. The simulations are similar to the one proposed in [4] and [5]. In particular, we 
simulate T = 200 observations from the linear model Y, = X/B + €,, where the true 
values for the regressors are set as follows: 


Simulation 1. —B = (3,1.5,0,0,2,0,0,0)', 
Simulation 2. ß = (0.85,0.85,0.85, 0.85, 0.85, 0.85, 0.85,0.85)', 
Simulation 3.  B = (5,0,0,0,0,0,0,0)’, 


The first simulation corresponds to a sparse regression case, the second to a dense 
case, and the third to a very sparse case. The covariates are independently generated 
from a “~ (0,2) with o; j = 0.5 li-jl. Two different distributions for the error terms 
generating process are considered for each simulation study. The first is a Gaussian 
distribution N (u, 0°), with L set so that the t-th quantile is 0, while 0? is set as 9, 
as in [4]. The second distribution is a Generalized Student t 9. Y ( LL, 07, v) with two 
degrees of freedom, i.e. v = 2, 0? = 9 and y set so that the t-th quantile is 0. For 
three different quantile levels, t = (0.10,0.5,0.9) we run 50 simulations for each 
vector of parameters (8) and each distribution of the error term. Table 1 reports the 


median of mean absolute deviation (MMAD), i.e. median (xf, yb ots I); 


and the median of the parameters B, over 50 estimates. Results for the first sim- 
ulation are reported, since results from the other two simulations are qualitatively 
similar. The proposed Bayesian quantile regression method based on the SEP like- 
lihood performs better in terms of MMAD for both distributions of the error term. 
This is evidence that the presence of the shape parameter œ in the likelihood better 
capture the behavior of the data. The estimated shape parameter is indeed greater 
and lower than one in the Gaussian and Generalized Student’t cases, respectively; 
this provides a more reliable estimation of the vector B, regardless of the tail weight 
of the error term distribution. These results are reinforced in the second and third 
simulation (not reported here) in which we exaggerate the density and the sparsity of 
the predictors structure. Furthermore, the proposed robust method reduces the bias 
of estimated B for all quantile confidence levels. Regarding the shrinkage ability 
of the proposed estimator, when the true parameters are zero, the SEP distribution 
performs better than the ALD in identifying the parameters . 
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ALD SEP 


t=0.10 t=050 t=0.90 t=0.10 t=050 t=0.90 
MMAD 1.0131 1.1008 1.0579 0.9096 1.0955 0.9708 


Error distribution Par. 


Bi 3.1323 3.2209 3.2145 3.0744 3.0036 3.2127 

Bo 1.6408 1.4786 1.6165 1.7656 1.4833 1.6800 

Bs 0.0444 0.0294 0.0267 0.0428 0.0228 0.0186 

Gaussian Ba 0.0453 0.0243 0.0235 0.0248 0.0191 0.0156 
Bs 1.2731 1.2379 1.3471 1.3969 1.8405 1.4702 

Bo 0.0185 0.0161 0.0205 0.0124 0.0127 0.0128 

Bi 0.0112 0.0106 0.0120 0.0067 0.0063 0.0095 

Bs 0.0073 0.0078 0.0064 0.0038 0.0047 0.0051 

MMAD 0.5163 0.1807 0.4685 0.4777 0.1789 0.4275 

Bi 3.0630 2.9884 2.9874 3.0826 2.9877 2.9934 

Bo 1.0484 1.3700 1.1366 1.0952 1.3951 1.2110 

Bs 0.0304 0.0144 0.0325 0.0252 0.0135 0.0412 

Generalized Student t Ba 0.0258 0.0181 0.0162 0.0263 0.0163 0.0138 
Bs 1.7012 1.9036 1.7701 1.7558 1.9111 1.8052 

Bo 0.0128 0.0085 0.0137 0.0074 0.0072 0.0136 

Br 0.0055 0.0057 0.0101 0.0052 0.0066 0.0082 

Bs 0.0067 0.0009 0.0002 0.0051 0.0011 -0.0021 


Table 1 Multiple regression simulated data example 1. MMADs and estimated parameters for 
Simulation 1 under the SEP and ALD assumption for the quantile error term. 


4 Conclusion 


We show how to implement the Bayesian quantile regression when the SEP distri- 
bution is considered. Linear and Additive Models (AM) with penalized spline are 
used with Lasso priors to account for the problem of shrinking parameters. Empiri- 
cal analysis highlights how the SEP quantile regression better capture the behaviour 
of the data when outliers or heavy tails are concerned. 
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Bayesian Factor-Augmented Dynamic Quantile 
Vector Autoregression 


Stima bayesiana di modelli quantilici dinamici 
autoregressivi fattoriali 


Bernardi Mauro 


Abstract This paper introduces a novel Bayesian model to estimate multi—quantiles 
in a dynamic framework. The main innovation relies on the assumption that the 
t-th level quantile of a vector of response variables depends on macroeconomic 
variables as well as on latent factors having their own stochastic dynamics. The 
proposed framework can be conveniently thought as a factor-augmented vector au- 
toregressive extension of traditional univariate quantile models. We develop sparse 
Bayesian methods that rely on state space representation and data augmentation ap- 
proaches that efficiently deal with the estimation of model parameters and the signal 
extraction from latent variables. 

Abstract Questo lavoro introduce un nuovo metodo per la stima di quantili dinam- 
ici multipli. L’innovazione consiste nell’assumere che il quantile di livello t di un 
vettore di variabili risposta dipenda da fattori macroeconomici e da fattori latenti 
aventi una loro dinamica. Il modello proposto puo essere convenientemente pensato 
come l’estensione dei modelli quantilici univariati tradizionali per un modello fat- 
toriale autoregressivo vettoriale aumentato con l’introduzione di fattori latenti. La 
stima dei parametri e l’estrazione del segnale latenti sono effettuati proponendo un 
algoritmo Gibbs sampler con una distribuzione a priori che introduce stima sparsa 
dei parametri. 


Key words: Quantile vector autoregression, Bayesian inference, Asymmetric Laplace, 
factor models, sparse estimation. 
1 Introduction 


Quantile regression models have been becoming increasingly popular because of 
their attractive characteristics of modelling the quantile of a response variable as 
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a function of some covariates. Indeed, quantile models provide a more complete 
picture of the conditional distribution of the response variable than the traditional 
regression approach without relying on strong assumptions about the form of the 
error term. However, despite their obvious powerfulness, quantile methods have 
been mostly confined on modelling univariate response variables, see, e.g., Koenker 
(2005). In this paper we extend univariate quantile regression models to deal with 
multivariate response variables. Specifically, we model the marginal quantiles of a 
multivariate random variable as a function of macroeconomic variables and of latent 
factors having their own stochastic dynamics. Dynamic latent quantiles have been 
introduced by De Rossi and Harvey (2009) an extended to the bivariate Bayesian 
framework by Bernardi et al. (2015). We develop Bayesian methods that rely on 
state space representation and data augmentation approaches that efficiently deal 
with sparse estimation of model parameters and the signal extraction from latent 
variables. A multivariate Asymmetric Laplace distribution in imposed to the error 
term of the measurement equation in order to model 7(0,1)-th level quantile of 
each marginal random variable. When dealing with multivariate latent models the 
curse of dimensionality prevents any parametric inferential procedure. To overcome 
this problem we rely on sparse methods and in particular on the spike-and-slab 
(Mitchell and Beauchamp 1988 and George and McCulloch 1993) Least Absolute 
Shrinkage and Selection Operator (LASSO) prior of Tibshirani (1996). 

The remainder of the paper is organised as follows. Section 2 introduces the 
multivariate Asymmetric Laplace distribution and its main properties. Section 3 in- 
troduces the dynamic factor-augmented quantile model and Section 4 deals with 
Bayesian inference and signal extraction. 


2 Multivariate Asymmetric Laplace distribution and quantiles 


In this Section we first introduce the multivariate Asymmetric Laplace (AL) distri- 
bution and its stochastic representation which will be useful to develop the data— 
augmentation Gibbs sampler scheme. Then we prove that the multivariate AL dis- 
tribution characterises the univariate marginal quantiles. 


Definition 1. Consider a p—dimensional random vector Y = (Y,Ya,...,Y,) € R? 
from a multivariate Asymmetric Laplace (AL) distribution (see Kotz et al. 2001). 
The density of Y AL, (8, 6,2) is given by 


fx (y| 9,6,2)= 


2exp{E PD (y— 9)} | 5(y, 0,2) ie 
(2n)P/? |x] 1/2 2G 1g 


x Ky (Vere) ô (y, 0.2) (1) 


where V € R? and € € R? are a p-dimensional vector of location and shape 
parameters, respectively. Moreover, D = diag{01,02,...,0p} with oj; > 0, for 
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j= 1,2,...,p and ¥ is a correlation matrix, such that Y = DYD, v = ae, 
5(y, 8,2) = (y— 8) Z-! (y — ©) is the squared Mahalanobis distance between y 
and V and K, (-) is the modified Bessel function of the third type with index param- 


eter V. 


The multivariate AL distribution can be represented as a Gaussian location-scale 
mixture with the Exponential distribution acting as mixing random variable. 


Proposition 1. Let Y ~ AL, (8, é,£) as introduced in Definition 1, then 


Y=04+DEW+E2VWZ, (2) 


where Z ~ N, (0,1), W ~ Exp (1) with Z; LW for j =1,2,..., p, see, e.g., Kotz and 
Nadarajah (2004). It follows from equation (2) that ¥Y |W = w ~ N, (® + Déw, £w) 
and that the unconditional distribution of Y is given by equation (1), see Kotz et al. 
(2001). 


The following remark instead characterises the behaviour of the marginal distribu- 
tions of each component Y;, for j = 1,2,...,p. 


Remark 1. Let Y ~ AL, (0, €,2) as introduced in Definition 1, then 

Yj ~ ALi (8;,6;.0}), 6) 
where 3; € R is the j-th element of 3, €; € R is the j-th element of È, and oF €Rt 
is the j-th element of the diagonal of the scale matrix D, see Kotz et al. (2001). 


The next proposition characterises the AL distribution as the natural candidate for 
modelling the innovation term in multivariate quantile models. 


Proposition 2. Let Z = (Z1,Z3,...,Zp) € RP with Z ~ D, where D is an unknown 
probability density function, if we assume the AL as misspecified density for Z, i.e., 
Z ~ AL, (0,6,5), then 


P(Zj < 8j) =, (4) 
if and only if 
1-27 
= 7009) al 
2_ 26; 
=T (6) 


for j =1,2,...,p, where o? is the j-th diagonal element of the matrix D, 6; € Rt 
and T € (0,1) is the quantile confidence level. 

Proof. Following Kotz et al. (2001) the marginal distribution of Z;, for j = 1,2,...,p 
is Asymmetric Laplace, i.e., Zj ~ AL (o, Gi, o?) , and imposing the conditions (5)— 
(6) the result follows immediately, see, e.g., Yu and Moyeed (2001). 
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3 Dynamic latent factor-augmented quantile model 


In this section, we introduce the dynamic latent factor-augmented quantile model. 
I 

Lety, = (eyz er Jat) € R’ and x, = (X1 X243- a). € R? be random vec- 

tors, we assume that (yh, x)' is a linear function of latent factors 


Vil _ A B Xt Vy = 
kl 7 ia k k i hi “o FORI di 


where A is a (dxs) matrix of loadings of the stochastic latent factors X, = 
(X1,13%21»:--3%s4) B is the (d x p) matrix of loadings of the observed factors x, Ig 
and I, denotes an identity matrices of dimension d and p, respectively, and 0(,x5) 
and 0, denote zero matrices of dimension p x s and p x p, respectively. The stochas- 
tic term €; = (€11,81 ed 1844) follows a multivariate Asymmetric Laplace distri- 
bution defined in equation (1), i.e., € ~ ALa (8, 6,2). The state space formulation 
is completed by specifying a dynamic evolution for the latent and observed factors 
X; and x; 


Xil gn ®, 01 1%, = = 
| |=# È Pall |+ t= DT=, (8) 


Xt+1 Xr 


where 7), ~ Na+p (Os+p, Q1) with 
1,11 91,2 
o= o; nai ©) 


positive definite matrix of order d + p with ab EMSS, Q?? E MPP), ob! E 
M's?) oe. = ai, and | and ©; are (s x s) and (p x p) transition matrices and 
u = (ui, u}y € RIH with u, € R and u, € RP. Here .//(P:9) denotes the space 
of matrices of dimension p x q. The transition equation (8) specifies first order vec- 
tor autoregressive processes (VAR) for both the latent and observed factors. Indeed, 
alternative more flexible autoregressive specifications can be imposed by exploit- 
ing the companion form representation of VAR models, see, e.g., Harvey (1989). 
Furthermore, without loss of generality, we assume the VAR dynamic for x; to be 
stationary, i.e., all the eigenvalues of the matrices ®;, for j = 2 are outside the unit 
circle. 

Concerning the specification of the initial states, we assume 7, ~ N (z i0Pilo) ; 
where the variance—covariance matrix Pj) can be diffuse to handle the presence of 
non stationary elements of the latent states 7,. Moreover, given the imposed sta- 
tionary dynamic evolution of the observed states x;, we assume x; ~ N (ilo; Vijo) 5 


where jo = (Ip — @5)~' u, and vec (Vio) = (1,2 -P2 >) vec (Q22) and 


2, is the square matrix of dimension p x p of the long-run matrix Q22 = 
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lim; a2”, We name the model defined in equations (7)—(8), the Factor-Augmented 
Quantile Vector Autoregression (FAQVAR) model. 


4 Bayesian inference for the FAQVAR 


The theory underlying the signal extraction and the Bayesian posterior computation 
and simulation of the quantiles can be stated for a generic model in state space form 
(see, e.g., Harvey 1989 and Durbin and Koopman 2012), where, without loss of 
generality, we assume the scale parameter of the AL distribution depends on time. 
Signal extraction and posterior mode computation of the latent generalised quantile 
are based on the Kalman filter (Kalman and Bucy 1961) and the associated smoother 
algorithm, see, e.g., De Jong and Shephard (1995) and Durbin and Koopman (2002). 
Let 


+ _ |Yt + (kK 
[a= f. n 
then 
he + 
Vi = Zx} + He, (11) 
x = H+Tx +N, ad eens be (12) 
xi ~N (Merlo) ; (13) 
where the selection matrix H, and the measurement and transition matrix (Z, T) are 
defined as 
a a 4 | Di Osx | 
H= ; Z= , T= sxp)| 14 
H i lo, Ip] Opx Pa |’ a9 
t vv \ pt Pilo %sxp) DE 
and Xij = (Zio) Pio = lo v , where A, B, Pi, Da and X jo. Xo. 
(pxs) 1/0 


Pijo and Vo have been defined in the previous section. 

The linear state space model introduced in equations (11)-(12) for modelling 
time-varying conditional quantiles is non-Gaussian because of the assumption 
made on the measurement innovation terms. In those circumstances, optimal fil- 
tering techniques used to analytically marginalise out the latent states based on 
the Kalman filter recursions can not be applied (see Durbin and Koopman 2012). 
However, exploiting the stochastic representation of the AL distribution in terms of 
location-scale continuous mixture of Gaussian in Proposition 1, the non-Gaussian 
state space model defined in equations (11)-(12), admits as conditionally Gaus- 
sian and linear state space representation. More specifically, equations (11)-(12) 
become: 
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yi =a@,+Zyz} + He; (15) 
ty =H+TX +N, t=2,3,...,T (16) 
+ zi pi 
NA obi)» a7) 
DE ; + 
where a = 0 , with D = {O11,022,-.., 04a}, €; ~ Na (04,02), t= 15235435; 
(px1) vi 
are i.i.d. are Gaussian innovations, ©; ~ Exp (1) independent of €;, for t =1,2,...,T 
and X = DWD, with D = VSD and ô = qe: We assume the following prior dis- 
tributions for the parameters £ = (A,B, 0j, j = 1,2,..., p, P, U, 1, B2, Q) 
A ~ Naxs) (H221 Q Z1) (18) 
B ~ Naxs (up: 289 Zp) (19) 
d 
ajy ~ [T'S (20,60), j=1,2,....p (20) 
j= 
= 2 2 
wij CITI {(1-2)N(0,v0) 11,1) (Wi) +N (0,77) 11,1) (Wi) 
i,j=1,2,....d 
(21) 
u ~ N (uQ, Ep) Pı ~N (ug, 29 ) (22) 
a~ N (u9 29) Imy (2) 2, ~ IW (co, Q11), 23) 


which are Normal, Inverse Gamma and Inverse Wishart respectively, with densities 


(x) cx exp { 2} (24) 
a (M) = [Col M (cot) exp {—tr (CoM~')} 


and Ojj = {0}.j.j= 1:21.04} and O; j = {oji j= 1,2,...,d,i< i}. where 
O; j denotes the (i, j)-th entry of the matrix X. Standard Gaussian priors are im- 
posed on the loading factors (A, B) even if shrinkage Lasso could be used instead. 
Concerning the prior specification of the variance—covariance matrix X, we follow 
the same approach of Wang (2015) which extends the spike-and-slab approach of 
Mitchell and Beauchamp (1988) and George and McCulloch (1993) to positive— 
definite matrices, which has recently received much attention as a viable alternative 
to Lasso prior to introduce sparsity in large dimensional regression models as well 
as to model variance—covariance matrices in Gaussian graphical models. Specifi- 
cally, we impose an Inverse Gamma prior on the main diagonal elements of X in 
equation (20) and a spike—and-slab prior for the off—diagonal elements in equation 
(21). The values of vo and vi are further set to be small and large, respectively and 
the term C represents the normalising constant that ensures the integration of the 
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density function 2 (X) over the space of positive—definite matrices is one, and it de- 
pends on {0;.;, i,j =1,2,...,ptd, T, vo, V1}. Prior in equation (21) can be defined 
by introducing binary latent variable Z = (Z; ;);<; € Z = {0, ppPtaetd-V/2 and 
the corresponding hierarchical model 


n(Z|Z) = [IN (oi; |0,%,) (25) 
i<j 
za) =] (0-a), (26) 
i<j 
Vo if Zij = 


where v; 


zii T 


E The rationale behind using Z for structure learning 
VI SIE, ckj = 


is as follows. For an appropriately chosen small value of yo the event z;,; = 0 means 
that o;,; comes from the concentrated component N(0, v?), and so G;; is likely to be 
close to zero and can reasonably be estimated as zero. For an appropriately chosen 
large value of vj, the event z; ; = 0 means that o;,; comes from the diffuse component 
N(0,v 2) and so 6j,; can be estimated to be substantially different from zero. Because 
zeros in X desimine missing edges in graphs, the latent binary variables Z can 
be viewed as edge-inclusion indicators. Given data y', the posterior distribution 
of Z provides information about graphical model structures. The next proposition 
characterises the full conditional distribution of the parameters (¥, o). 


Proposition 3. Give the latent indicators Z and the latent variables ©, the condi- 
tional posterior distribution of & = HXH” can be factorised as follows 


i, hee Ba 
x (¥.0| Y,W.Z) = 2l? expf -31r (S010) | 
1 2 = 
x exp{ -5r (PwD) 


xepf-sr(Wosg#!)} 


«Tool MM Tor] api 27) 


i<j 
where 0 = i é, W = diag{@,@,...,Or}, ¥ = HYH' and S = YW~'Y’, with 


Ý = [yı - Z% Y2- Zap... yr — ZxXr] (28) 


is the d x T matrix of observations y, — ZX, t = 1,2,...,T stacked by row. Then 
z (oz; | Y,W,Z,¥, Gz) œ N (ño, Čo) 1o, (0) (29) 


with [lg = To (b0 + 22) and Tg = x25, for j= 1,2,...,d and 
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(| ¥,W,Z,¥j,0) = Nai (Ayy) 11) (Via) (30) 
with 
lela 1/gy—p-1/2 iv 
~ ! #-lw-lp- 1 p-1/2y-1p-,- I we 
Ly, = Ty; |812D 1% 1 Di tad Fh Di +b P La (31) 
t=1 


z-1 _ fg! DDL Hw! u—l py !/2 
ty, = Di Ba SAD +D; Pa An Pii Di 


$ 
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Does data structure reflect monuments 
structure? Symbolic data analysis on Florence 


Brunelleschi Dome 


La struttura dei dati riflette le caratteristiche strutturali dei 
monumenti? Analisi di dati simbolici relativi al 


monitoraggio della Cupola del Brunelleschi a Firenze 


Bruno Bertaccini, Giulia Biagi, Antonio Giusti and Laura Grassini! 


Abstract. 

The paper describes the work in progress about the analysis of the behaviour of the 
web cracks on the Brunelleschi’s Dome of Santa Maria del Fiore in Florence. The 
web cracks in the Dome have always given rise to concern about the stability of the 
monument. The mechanical and electronic instruments have generated more than 6 
million measurements, and the analyses performed so far, showed a steady increase 
in the size of the main cracks and, at the same time, a relationship with the 
environmental variables. The paper provides a continuous monitoring through those 
(big) data with the methods of the Symbolic Data Analysis techniques. 

Abstract. // contributo presenta l’attività in corso circa l'analisi del comportamento 
dell’insieme di fessure presente sulla cupola del Brunelleschi di Santa Maria del 
Fiore a Firenze. La “ragnatela” di crepe nella cupola ha sempre dato adito a 
preoccupazioni circa la stabilità del monumento. Gli strumenti meccanici ed 
elettronici installati sulle fessure hanno generato oltre 6 milioni di misurazioni, e le 
analisi effettuate finora, hanno mostrato un costante aumento delle dimensioni delle 
principali crepe e hanno evidenziato, allo stesso tempo, un rapporto con le variabili 
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ambientali. Il presente contributo intende fornire, attraverso tecniche di analisi di 
dati simbolici, un'analisi di monitoraggio nel tempo del dataset a disposizione. 


Key words: Symbolic Data Analysis, Big data, Brunelleschi’s Dome. 


1. Introduction 


In Gothic style to the design of Arnolfo di Cambio and completed in 1436 with the 
dome engineered by Filippo Brunelleschi, Santa Maria del Fiore does not need to be 
mentioned, except to state that by 1418 all that was left to finish was the dome. 
Weighing 37,000 tons and using more than 4,000,000 bricks, Brunelleschi's dome 
was the greatest architectural feat in the Western world. 

First cracks in the dome appeared at the end of the 15th century, because “the 
weight of the upper dome and of the lantern (at the top of the dome) exceeds the 
resistance of the base of the monument’. In 1695, a first commission with the task to 
investigate on the stability of the Dome was established, but nothing concrete has 
been done so far. To date Brunelleschi's dome is the only large dome of the 
Renaissance that had not yet been protected by actions of containment (rigid 
structures). For this reason, the monitoring system installed in the Dome, with more 
than 160 instruments (e.g., mechanical and electronic deformometers, thermometers, 
piezometers), is currently one of the most accurate control systems installed on a 
historical and architectural monument. 

Cracks are now present in all eight webs, mainly in the fourth and sixth webs, 
both at the opposite of the nave. In web 4, currently the main crack shows an 
average increase of 5.5 mm/century. 23 deformometers are currently monitoring 
webs 4 and 6 (13 in web 4 and 10 in web 6) since 1988. With 4 measurements per 
day, there is a lot of produced data which requires a multifaceted approach to deal 
with the long term patterns, seasonal and climatic reactions, impact of other 
temporary factors (for example: earthquakes). 

Those data are already analysed by various researchers (see for example: [1], 
[2]). In particular, Bertaccini in 2015 [2], tried to explain the complex 
interrelationships between deformometer measures through a SEM model, in order 
to represent the so called breathing mechanism of the Dome over time: the cracks 
tend to expand and shrink cyclically according to seasonal, climatic factors, some in 
a concordant way and others not. 

In this paper, we present a descriptive analysis of that large amount of data, 
using interval valued variables, according to the Symbolic Data Analysis (SDA) 
approach [3]. SDA has already been used in studies concerning the health 
monitoring of civil engineering structures with the generally aim to provide tools 
and instruments to monitor the behaviour of a structure and detect or announce any 
abnormal behaviour ([4], [5], [6]), and a first study of the Brunelleschi’s Dome data 
was presented in [16]. 

The aims of this work are: 
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1) to compute variability of measurements over time, by trying to discover 
any underlying trend; 
2) to explore the relationships between the various cracks as discussed also in 
[2], relationships which resembles the breathing mechanism of the Dome 
over time. 
The common theoretical framework for both point 1 and 2 is the properties of 
decomposition of covariance according to [7], [8], [9]. In fact, the covariance 
between two interval valued variables can be decomposed into the sum of the within 
component, related with the size of the intervals, and the between component, which 
is simply the covariance between the intervals midpoints. 

Data are provided from the arithmetic mean of the intra-day measurements of 23 
deformometers installed on webs 4 and 6. An interval variable reporting the 10th 
and 90th percentile of the daily average within each week, since 1988 and July 2007, 
has been defined. In this way, we drastically reduce data size and allow for 
situations in which daily measures, for temporary faults, are less than 4. 

The paper is structured as follows. Sections 2 recall some basic algebra for 
interval valued data. Section 3 describes the empirical analysis, and Section 4 
contains some final remarks. 


2. Arithmetic of interval symbolic data and statistical parameters 


Before analysing the statistical indices applied on symbolic interval data, it is useful 
to recall some algebra related to interval analysis [10]. 

Let us consider two intervals X = [x,,x2] and Y = [y1, y2] where [x,,x2] and 
[y1, y2] are respectively the minimum and the maximum of each interval. 


The addition and subtraction between two intervals are respectively: 
[x1,2] + Ly y2] = [x + 1, x2 + y2] 
[x1,2] — Ly y2] = [x — Yor x — Val 

Moreover, the linear combination of two intervals with coefficients a, b is: 


[ax, + by,,ax, + by] ifa>0,b>0 
a[x1,x2] + bby, y2] = $ [axı + byz,ax,+by,] ifa>0,b<0 (1) 
[ax + byz,ax,+by,] ifa>0,b<0 


Let us consider, now, a set of observations on X and Y represented by n intervals: 


Xi = Kix] Y; = Div Yiz] , 10, n. 


The centre of each interval i is the midpoint of the interval. For X: 


PA Xia + Xiz 


i 7 2) 
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and the overall mean is: 


n n 

a Ah 1 

x= DE =D! Xi1 + Xi2) (3) 
i il 


Several definitions of sample variance are introduced in the SDA field. Billard 
[7] suggests the following: 


2 
Si = 3n De + Xi Xig + xi) — 4n An2 = [do ae Xi2) 
1 


i=1 


(4) 


It is derived by [11] under the assumption of uniform distribution within each 
interval. It is proved that (4) can be rewritten as 


nSZ = > [a — Xi)? + Gin — XD Gin — Xi) + (ia — Xi) 
ipse + x; -x) (5) 


Expression (5) shows that the total deviance nS? can be decomposed as the sum 
of two components: (1) the internal variations of the data (SSW) and (2) the between 
variations (SSB), expressed by the comparison between the interval means and the 
overall mean: 


1X n E E > 
SSW, ==) [a = X0° + Ca -RDC — X) + Gia = 207] 


34 
i=1 
n 
1 2 
= 3) Ge — Xi) 
i=1 


SSB, = 5 (= + Xiz a - Ya _ x)? (7) 


i=1 


(6) 


Finally, also the sample symbolic covariance between two interval variables can 
be expressed as the sum of within and between components [7, 9]: 


CODEVT = nCov(X,Y) = YL, Senso + yn, -DA-D 0) 


where the left and right elements in the sum are, respectively, the within (CODEVW) 
un-centred codeviance on the ranges, and the between sample co-deviance 
(CODEVB) between X and Y. These expressions are derived under the assumptions 
of uniform distributions within the intervals. From (8), we see that the within 
component (CODEVW) is not a true covariance matrix of the ranges because it is not 
computed on ranges centred on the mean. CODEVW is always positive and its 
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magnitude depends on the intervals’ ranges. Therefore, the within codeviance 
incorporates information about the size of the individuals' rectangles. The between 
component is the co-deviance of the centres, in the classical framework. 

The codeviances, CODEVT, are always larger than the classical codeviance 
matrix based on midpoints, which coincides with the between part and is used in the 
centres method. It follows that there are fewer negative terms in CODEVT than in 
CODEVB, and that the sign of CODEVB may be negative while the sign of 
CODEVT positive. 

From (8), we can derive the symbolic correlation between two interval symbolic 
variables. A crucial advantage of the symbolic covariance between two intervals is 
that it fully utilizes all information in the data. 


3. Data description and analysis 


The data taken into account in the analyses are provided by part of the electronic 
monitoring system installed by ISMES in 1987. That system consists, among the 
others, of 66 deformometers, 56 thermometers, and two piezometers. This system 
records data every 6 hours starting on January 8, 1988. Data used in this work end 
on July 31, 2007 [12, 1], which means 35,100 measures per deformometer, for 
approximately 6 million measurements. 

Each deformometer records the deformation of the building: since 1988, it has 
been recording the growth (with positive values) and the reduction (with negative 
values) of a crack. The minimum and maximum values over a given time period 
define an interval valued variable. The centre of that variable is the average growth 
or decrease of the crack with respect to the 1988 status, under the hypothesis of 
uniform distribution within the interval. 

Available data are affected by the presence of missing data and outliers [2], 
mostly due to storms and blackouts that often cause calibration problems. In order to 
analyse a complete data matrix, we used the cleaned data provided and used in [2]. 

The localization of the deformometers in the structure determine their behaviour: 
those located near the tambour are more stable and exhibit less variability and a 
more similar time pattern. 

Web 6 exhibits, on the average, a larger variance than web 4, although the 
contribution of the within component is much lower, on the average (0.37% for web 
6, 4.94% for web 4). 

However, the cracks behaviour changes in time and it requires to take into 
account the dynamic nature of data. Therefore, we have computed a moving 
variance, by using a moving window of 52 weeks (one year). The idea is to explore 
the different time variability of the cracks, by selecting the year as the basic time 
window. 

The use of adaptive methods and the choice of an annual period are not new in 
structural health monitoring and is practiced also with multivariate methods (for 
example, moving and recursive principal component analysis [13], [14], [15]). 
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Moreover, moving or rolling statistical indices are commonly computed to assess the 
constancy of model parameters. In this case, we wish to assess the importance of the 
between components of the variance. 

Figure 1 shows the time patterns of the moving symbolic variance and the 
variance between, for some measurements. We can appreciate the different 
relevance of the within component, which is higher for some cracks (see graphs on 
the left) than others (see graphs on the right), depending on the placement of the 
cracks. 
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Fig. 1: Moving symbolic variance (black) and moving variance between (red) for some deformometers’ 
measurements of webs 4 and 6 


The exploration of the relationships among deformometers’ measures is 
accomplished by the symbolic covariance matrix (see Figure 2a), in which we can 
appreciate the presence of strict positive relationships and not only within the same 
web. Only the deformometer D4.9 shows a negative correlation but it is mainly due 
to the abnormal behaviour in the early stages (Fig. 3). The variables are ordered to 
enlighten the similar behaviours. The determinant of the correlation matrix is almost 
zero, confirming the high linear relationships. However, there are 10 cases with a 
positive symbolic correlation and a negative correlation between midpoints, 
although the size of the correlation is very low. This huge presence of positive 
correlations may be determined by the positive contribution of the within 
covariance. For this reason, we provided the between correlation matrix as well (see 
Figure 2b). 

From both matrices we see that the breathing mechanism of the Dome is mainly 
(about completely) characterized by a harmonious movement of the cracks as the 
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positive correlations are definitely dominant and stronger than the negative 
correlations. 

Excluding deformometer D4.9, negative correlations occur specifically for the 
cracks D4.1, D4.2, D6.1, D6.7, which tend to counteract the behaviour of most 
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Fig. 2a: Symbolic correlation matrix (order: first Fig. 2b: — Between correlation matrix (order: first 
principal component) principal component) 
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Fig. 3: Time plot of the min and max values for D4.9 


4. Final remarks 


The findings are consistent with the ones of [2]: the structure of the Dome may be 
assimilated to what in physics is defined as a “closed system”, in which the 
structural constraints define the relationship of forces between the various cracks, 
which in turn are subjected to the action of meteorological and seismic variables. To 
date, in literature a study that involves all variables simultaneously detected by the 
monitoring system is missing. A joint analysis of all available data determinates 
serious computational burdens and estimation and interpretation problems, due to 
the complexity of the relationship between the variables and the number of measures 
acquired per day. 

With a reduced computational effort and working on a dataset reduced at less 
than the 15% of the one based on the daily averages, the Symbolic Data Approach 
conserves the same structure in the data. We are confident that the methods proper 
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of the Symbolic Data Analysis will permit to solve all those critical aspects that until 
now have prevented a comprehensive description of the mechanisms of the static- 
structural evolution of the monument, and to simulate the possible “reactions” to 
environment changes of exceptional nature. 
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A latent markov model approach for measuring 
national gender inequality 


Modello latent markov per la misura delle disequita di 
genere nazionali 


Gaia Bertarelli and Franca Crippa and Fulvia Mecatti 


Abstract Gender inequality - both in space and time - is a latent trait, namely only 
indirectly measurable through a collection of observable variables and indicators 
purposively selected. Even if composite indicators are normally used by social- 
scientists, when measuring gender-gap they are known to have case-specific tech- 
nical limitations. In this paper we propose an innovative approach based on a mul- 
tivariate Latent Markov model (LMM) for the analysis of gender inequalities as 
measured by the aforementioned indicators. 

Abstract La Statistica di Genere si occupa di sviluppare metodologie atte a cogliere 
disparità e differenze nella situazione delle donne e degli uomini in tutti gli aspetti 
della vita. Negli ultimi anni le disponibilità di dati per l’analisi di genere è aumen- 
tata poiché sempre più paesi stanno adottando survey specifiche. Gli strumenti più 
comuni nella letteratura della statistica di genere sono gli indicatori compositi che 
tuttavia presentano note limitazioni metodologiche. Vogliamo proporre un approc- 
cio innovativo alla statistica di genere basato su un modello latent markov multi- 
variato per le analisi delle disuguaglianze. 


Key words: Gender Statistics, Clustering, GID-Database OECD, latent variable. 


1 Background and Introduction 


Gender equality is a recognized goal of modern democracies and an objective for 
global civilization since the effects of policies and actions capable at reducing gen- 
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der disparities would actually benefit the society as a whole, both women and men. 
The availability of good quality data for engendered statistical analysis at the na- 
tional level has increased since the 90’s. Gender statistics based on household sur- 
veys and administrative records are becoming widely available. Gender inequality 
is a latent trait, namely only indirectly measurable through a collection of observ- 
able variables and indicators purposively selected as micro-aspects contributing to 
the latent macro-dimension. This is one of the main reasons for the popular use of 
composite indicators as current gender statistics indicators, i.e. aggregations - usu- 
ally linear combinations - of a collection of simple indicators each singled out for 
assessing a puctual micro-aspect of the latent gender dimension. Several world rank- 
ings, based upon national gender composite indicators, are periodically released by 
supranational agencies (see for instance [2] for a comparate review). Even if nor- 
mally used by social-scientists, such gender-gap measures are known to have case- 
specific technical limitations [3], which often lead to internal inconsistency since the 
ranking of a single country can vary in relation to the indicator considered. More- 
over, a significant amount of the literature criticizes the use of composite indicators 
on the ground of trivial marginalization and arbitrariness [4]. In this paper we pro- 
pose an innovative approach to gender inequality measure based on a multivariate 
Latent Markov model (LMM). 


2 Data 


We focus on two inequality indexes, the Gender Inequality Index (GII) and the 
Global Gender Gap Index (GGGIJ). A main reason for selecting them is their recent- 
ness, whose the aforementioned technical issues ask for advanced knowledge. The 
GII, introduced by UNDP in 2010, measures gender inequalities in three aspects 
of human development: reproductive health, empowerment and economic status. 
The GGGI was first introduced by the World Economic Forum in 2006 as a frame- 
work for capturing the magnitude of gender-based disparities and for tracking their 
progress. Three basic concepts underlie the GGGI. First, the index focuses on mea- 
suring gaps rather than levels. Second, it captures gaps in outcome variables rather 
than in input variables. Third, it ranks countries according to gender gaps rather than 
women’s empowerment. It measures four aspects: economic partecipation and op- 
portunity, educational attainment, health and survival and political empowerment. 
Rankings based on these indicators are different from each other as well as not con- 
stant over time, as a consequence of different choices in both measurable variable 
selection and aggregation system. In this paper we consider a multivariate model of 
latent markov type, able to receive as input both indexes as well as a set of covari- 
ates. An improved gender inequality measure is expected as a result. A preliminary 
univariate analysis is conducted for the period 2010-2016 able to assess possibly 
measurement errors in GGGI and GII. After considering constitutional gender eq- 
uity (see http: //constitutions.unwomen.org/en) and social structure 
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as covariates in the latent model component, we introduce time use in the measure- 
ment part. 


3 Model 


LMMS (see [1] for a general review), are a class of statistical models for longitudi- 
nal data which assume the existence of a latent process which affects the distribution 
of the response variables. The existence of two processes is assumed: an unobserv- 
able finite-state first-order Markov chain u®, i=]1,...,nandt=1,...,7 with 
state space {1,...,m} and an observed process Y”, i=1,...,nandt=1,...,T, 
(o) 


; denotes the response variables for area i at time f and similary for U, O 


L 
We assume that the distribution of Y O depends only on uP: the latent process 
fully explains the observable behaviour of an item together with possibly available 
covariates. Therefore it is important to distinguish between two components: the 
measurement model, which concerns the conditional distribution of the response 
variables given the latent process, and the latent model, which concerns the distri- 
bution of this latent process. 

The unknown vector of parameters @ in a LMM includes both the parame- 
ters of the Markov chain @,,, and the vector of parameters of the state-dependent 
distribution @,,,. The measurement model involves @,,, and it can be written 


where Y 


obs 
as yO ~ f(y,u,® ops). The latent model includes the parameters $,,, of the 
Markov chain which are the elements of the transition probability matrix IT = 
{Tua} with u,u = 1,...,m; where Tyg = P(U! = ujug” = i) is the probability 
that area i visits state u at time ¢ given that at time ¢ — 1 it was in state u, and the 
vector of initial probabilities 7 = (71, ..., Tu,- , Am) where 7, = PUY? =u) is 
the probability of being in state u at the initial time for u = 1,...,m. In this work we 
consider homogeneous LMMs. 

LMMS can assess the presence of measurement errors or account for unobserved 
heterogeneity between areas in the analysis including covariates in the measurement 
model which do not completely explain the heterogeneity in the response variables. 
In LMMs the effect of the unobservable variable has its own dynamics. Moreover, 
a latent clustering of the population of interest can be pointed out. Our proposal 
is based on adapting the LMM to the gender statistics framework by interpreting 
national gender gap as the latent status of interest and using the distributions of 
the GGGI and GII as response variables. This metodology is derived by integrating 
into the same LMM both the selected composite indicators and a set of available 
observable covariates of any and possibly mixed nature. Our metodology organizes 
countries in ordinal clusters representing of the severity of gap. The classification 
is produced taking into account the values of the considered covariates and this 
overcomes the so called ”’world-at-two-speed” effect, i.e gender inequalities due to 
the denial of basic human rights (under-developing or in transition coutries) or due 
to uneven opportunities between men and women (developed countries with gender 
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equality stated by law) ([2])) which is evident especially in the GII’s distribution. 
However, looking at the temporal distributions, it seems that this gap goes to dwindle 
with time. Because of this, a longitudinal analysis is appropriated. We conduct a 
two-step analysis. At the beginning we apply a LMM with only spacial and gender 
constitutional equality covariates on the latent model in order to identify clusters of 
countries actually comparable under the ’two-speed” effect mentioned above. Then 
we apply a LMM within each cluster considering social and economic covariates in 
the measurement model to detect main differences and variability within the same 


group. 


4 Expected Results 


We propose to integrate into the same LMM both the selected composite indicators 
and a set of available observable covariates of any and possibly mixed nature, cate- 
gorical, ordinal and quantitative, fully exploiting the multidimensional latent nature 
of gender imbalance. The model would provide an organization of the countries in 
a (optimal) number of ordered cluster. The classification is produced taking into 
account the values of the considered covariates and this overcomes the so called 
*world-at-two-speed” effect which is evident especially in the GII’s distribution. 
Moreover the proposed methodology deals with the forecasting of the future re- 
sponse and the path prediction. 
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Eurostat's methodological network: Skills mapping 
for a collaborative statistical office 


Agne Bikauskaite and Dario Buono 


Abstract Collaboration, interaction and exchange of knowledge among staff are important 
components for development and enriching of scientific intelligence within a statistical office. 
Eurostat's methodological network has been built as a skills mapping tool aiming identify in- 
house competencies for innovation and affordability of diffusion of knowledge, promotion 
and modernisation of collaboration on methodological issues, and processes within statistical 
office. We mainly focus on staffs knowledge and working and academic experience in 
methodological areas, domains and tools on statistics and econometrics. Quantitative network 
analysis metrics are used to measure the strengths of existing methodological competencies 
within Eurostat, to identify groups of people for collaboration in providing results on specific 
tasks, or characterise areas that are not fully integrated into methodological network. By 
combining network visualisation and quantitative analysis, we able easily assess competency 
level for each dimension of interest. Network analysis helps us in making decisions related to 
improvement of staff communication and collaboration, by building mechanisms for 
information flows, filling competency gaps. Data represented as mathematical graph makes 
readily visible general view, absorbs its structure, permits us to focus on persons, 
competencies and relations between them. Modernisation of ways of working leads to a more 
cost effective use of existing resources. 


Key words: complex network, data analysis, network visualization, bipartite graphs, 
network projection, ego network, network analysis 


1 Introduction 


Collaboration, interaction and exchange of knowledge among staff are important 
components for development and enriching of scientific intelligence within a 
statistical office, especially when this exchange happens across areas of interest by 
both interacting sides. Methodological network has been built as a skills mapping 
tool aiming identify in-house competencies for innovation and affordability of 
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diffusion of knowledge and information, and promotion and modernisation of 
collaboration on methodological issues and processes within statistical office. We 
mainly focus on staff's knowledge and working and academic experience in 
methodological areas, domains and tools on statistics and econometrics. This paper 
provides a set of mathematical network analysis measures from basic ones as size 
and degree to more complex as clustering coefficient and their correlation with 
degree that evaluates and makes better understandable the methodological 
knowledge network structure. 


2 Methods 


Quantitative network metrics are used to measure the strengths of existing 
methodological competencies within statistical office, to identify groups of people 
for collaboration in providing results on specific tasks, or characterise areas that are 
not fully integrated into methodological network. Network analysis helps us in 
making decisions related to improvement of staff communication and collaboration, 
by building mechanisms for information flows, filling competency gaps. By 
combining network visualisation and quantitative analysis, we can easily assess 
competency level for each dimension of interest. 


21 Bipartite graph 


Network data consists of a set of elements with relations on those elements and it 
may be represented as a graph. Our research subjects, individuals, form links which 
characterise their competencies in statistics and econometrics. Formally we have a 


graph G=(V,E), where G is a relational structure consisting of set of vertices 
V and set of edges E [2]. We say that a graph is bipartite when the vertex set V is 
divided into two finite, disjoint V, MV, = sets [4]. When V, composed of the 


first mode vertices and V, of the second mode vertices, we have the bipartite graph 


G=V(V,,V,, E) where ties map the elements of different modes only. 


2.2 Network analysis 


In order to understand organisational methodological network and its structure 
network analysis statistical models have been employed. Data arranged as person by 


skill matrix A of size ny XNy,; where the rows correspond to methodological 
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network members interested in collaboration and knowledge exchange and the 
columns to the dimensions of competencies: 


1, if personi hasa link to methodological skill j ; 
7 10, otherwise. 
The two most basic parameters of the graph are the number of vertices 


n=n, +N, , where ny, =| and ny, = , and the number of edges 


m=|E]. [3] 

Degree of the vertex helps to identify the best known competencies, and to diagnose 
critical areas within the methodological network. The average degree of sets of 
vertices V, corresponding to survey respondents and V, characterising listed 
methodological competencies are commonly used summarizing how well connected 


the network is, and is defined as proportions of number of links the network and 
number of nodes [1] 
m 
ky, = —, wherek=12. 


n Vi 


While the average degree of overall network is obtained from the total numbers of 
nodes and edges by following equation [1] 


2m 
k =———_.. 
Ny +My, 
The density Ô of the bipartite graph G measures average ratio of the actual degree 


of the nodes in the network and the maximum possible degree, which corresponds to 
the number of nodes in the set of different mode nodes 


8(G) = 


ny My, 
This index is equal to 1 for the fully connected network (i.e. G has one component) 
and takes value of 0 when network is fully disconnected (i.e. G is composed 
entirely of isolates). 
The clustering coefficient which concerns link correlation gives an idea of how 
compact is the network. The clustering coefficient of a node i is the proportion of 


links between the nodes within its neighbourhood divided by the number of edges 
that could possibly exist between the nodes [4] 


= diji 
CC = 
i (k, Ni) H (k, Ni) + i 
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here j and / are a pair of neighbours of node i, q, is the number of squares 


which include these three nodes, and 77, =1+q,, +0, with @, =1 if i 


neighbours j and / are connected with each other and 0 otherwise. 


Existing correlation of links allows us to sustain collaboration between 
methodological network members, while otherwise would not be able to function. If 


persons i and k form links to common competencies j and /, then efficient 
cooperation between them is more likely possible. 


3 Results 


The methodological knowledge network of this study case is simple, undirected, 
unweighted, static, and structured as bipartite graph, which consist of 117 vertices 
connected by 595 edges. The competencies degree of staff participated in the survey 
ranges from 3 to 11 which a mean of 8.88. While the degree of competencies nodes 
ranges from 0 to 39, with a mean of 10.2, what indicates, that each competence from 
the list has been indicated as well known by 10 respondents on average. 

Degree sequence of competencies in statistics and econometrics indicates that most 
of methodological network members are familiar to Data Analysis and Time Series, 
highly competent in Social Statistics and National Accounts, and experienced in R 
and SAS statistical analysis software. While the biggest gap within methodological 
network observed of experts on Micro-data access and Statistical confidentiality, 
knowledgeable in Transport and Energy statistics, and capable to work with Hadoop 
tool. Other competencies from defined list are more or less covered and known by 
methodological network members. 

The standard density measure gives a value 0.17, which shows a fairly sparse 
network with presence of 17 per cent of the possible links for average node. 
However, in this particular case the standard denominator is clearly not appropriate 
defining methodological network members' competencies. Due to restriction of 
choice of maximum 11 dimensions out of 50 possible, it cannot be interpreted as 
actual possible density. Using modified denominator, network obtain density of 0.79, 
which indicates high competency level of methodological network members. 

In network studied, the clustering coefficient of competencies vertices set is not so 
high, above 20 per cent. The moderate correlation between clustering coefficient and 
degree is detected. 

Data represented as mathematical graph makes readily visible general view, absorbs 
its structure, and permits us to focus on persons, competencies and relations between 
them. We distinguish the two node sets by colours, so that nodes of the same type 
have the same colour. The vertices of staff willing collaborate are coloured in green, 
blue, and red depending on the type of interest in involvement, while the set of 
yellow vertices corresponds to 28 methodological areas, 12 statistical domains and 
10 tools. The size of the label and vertex is proportional to its degree. 
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Figure 1: Organisational methodological knowledge network 


In order to simplify visualisation, for deeper analysis of existing knowledge features 
and easier identification of clusters of correlated areas, methodological network has 
been divided into sub-networks by different breakdowns. Projections into one mode 
networks to grasp weighted relations between the same set of vertices had been made 
available as well by multiplying matrix A and its transpose A’. Analysing sub- 
networks we notice the tendency of increase of the density when average degree 
decreases. Overlapping of the structure of the nodes is very small, what points that 
there is large community of the methodological network members with knowledge 
and skills in different variation of areas. 


4 Conclusions and discussion 


In this study we map and evaluate existing methodological skills within the statistical 
office applying network analysis techniques. Networks as analytical and visualisation 
tools provide a number of useful outcomes. By detecting and then mapping 
methodological skills within organisation we are able to understand, spread, monitor 
and maintain existing skills, to develop tools for better knowledge accessibility and 
modernise information diffusion ways. 


166 Agne Bikauskaite and Dario Buono 
Obtained results provide quantitative evidence that methodological network 
members are qualified in different areas, given measures ensure possibility of well 
collaboration performance within the statistical office. Network is highly connected, 
significant gap of competencies is detected only in one methodological area from the 
defined competencies list of interest. 

We can outline the importance of detecting and monitoring existing knowledge and 
skills within modern statistical office. Two employees could affect each other only if 
they know about each other and that common competencies are available between 
them, as efficient communication and collaboration within the organisation is 
possible only when we know with whom we could potentially contact. As well 
modernisation of the statistical office's ways of working leads to a more cost 
effective use of existing resources. Network is a key source in promoting and 
supporting of knowledge diffusion and expanding, enriching professional and 
personal skills and filling knowledge gaps within statistical office. 
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Big Data and Population Processes: 
A Revolution? 


“Big Data” e processi di popolazione: una rivoluzione? 


Francesco C. Billari and Emilio Zagheni 


Abstract We first discuss the centrality of data paradigms in demography, doc- 
umenting their rise and fall over time also making use of Google Books Ngram 
Viewer. We then move on to discuss the undergoing “Data Revolution” in demog- 
raphy, with a focus on emerging forms of big data access and on the use of digital 
breadcrumbs. 

Abstract // contributo discute la centralità dei paradigmi basati su dati in de- 
mografia, documentando il loro emergere e declino anche usando informazioni 
derivate da Google Books Ngram Viewer. Successivamente si discute l’attuale data 
revolution in demografia, focalizzando l’attenzione sulle forme emergenti di accesso 
ai “bid data” e sull’uso di “briciole di pane” digitali. 


Key words: computational demography, Big Data, digital demography 


1 Four demographic data paradigms? 


In Kuhn’s well-known discussion of the role of paradigms and “normal science” in 
scientific progress, the “normal” data to be used within a group of scholars are cen- 
tral to a paradigm. Discussions and debates take place in relation when paradigms 
are challenged: “The pre-paradigm period, in particular, is regularly marked by 
frequent and deep debates over legitimate methods, problems, and standards of 
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solution” [19]. Given the data-intensive nature of demography, we characterize 
paradigms in demography by referring to the “normal” data used within a given 
paradigm. 

We now illustrate four data paradigms in demography !. In order to illustrate the 
rise and fall of data paradigms we use the approach based on the prevalence of com- 
bination of terms (Ngrams) in books indexed in the Google corpus and accessible 
through Google Books Ngram Viewer’. [2]. 


1.1 Census and administrative records 


That the study of population processes needs “Big Data” should come as no sur- 
prise. Indeed, data on population processes have always been “Big”, relatively to 
the epoch. It is useful to shortly recall here instances from the history of popula- 
tion research, taking into account that, historically, governments, churches and local 
authorities were the monopolists of data collection, curation and storage: census 
and administrative records are the paradigmatic data in this first demographic data 
paradigm. 

In addition to Malthus’ ground-breaking work on the relationship between pop- 
ulation change and economic development, which has been linked to the emer- 
gence of the modern population census, demographic research originates histor- 
ically from the creative and innovative use of data originally collected for other 
purposes. Graunt’s 1662 Natural and Political Observations Made upon the Bills of 
Mortality are considered the founding essay for demography, as well as for epidemi- 
ology and statistics. The patient and pioneering analysis of the bills, which were 
published at a weekly rate, together with the low technological level then available, 
tells us that Graunt’s population data were already “Big”, relative to the epoch. 

In historical demography, Henry and his colleagues pioneered the systematic 
linkages of parish registers to reconstruct population processes. They started from a 
village to extending this reconstruction to broader geographical areas, and general- 
ized the effort through the careful preparation and analysis of linked data [9]. The 
family reconstructions of Henry and colleagues were again already “Big data” for 
the epoch. 

The use of population-wide individual-level register data has become the marker, 
and the comparative advantage of Nordic demographers, with later efforts to link 
individual-level Census records and other registers extending to other countries 


' This section is loosely inspired by Gray’s idea of a fourth paradigm [15]. Our characterization 
of four paradigms in demography differs from the one of Courgeau and Frank, who refer to the 
micro-macro perspective through which relationships between the individual- and population-level 
have been seen over time [6]. 

? https://books.google.com/ngrams. Figures 1, 2, 3 and were generated using Google 
Books Ngram Viewer with the English language 2012 corpus, using data between 1800 and 2008. 
The robustness of results was checked against case sensitiveness. Héran [14] carried a detailed 
analysis of the “demographic vocabulary” using the same approach 
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such as Belgium and the Netherlands in particular. Register data with individual 
identifiers (PINs, i.e. Personal Identification Numbers) that allow to link multiple in- 
dividuals and to follow individuals over time are “Big Data”. Systems of PINs have 
been implemented in Sweden in 1947, in Norway in 1961, in Finland in 1964, and 
in Denmark in 1968 [22]. This lead, after some time, to the abolition of population 
censuses, with register-based “Big Data” also replacing the census for population 
counts. 


1600 1820 184 1860 3880 1900 1921 194 1960 1980 2000 


Fig. 1 The rise and fall of the “Census” in English books indexed in Google books. Source: 
https://books.google.com/ngrams 


The demographic data paradigm based on census and administrative records is 
interested only in macro-level outcomes. Even when individual-level data are used 
as the starting point, the main interest is to quantify population-level parameters. 
Formal demography has emerged, developing the mathematical basis of the mea- 
surement of population-level quantities and of the study population dynamics, to 
complement and to inspire data analyses. To quantify the rise — and fall — of this 
paradigm in a graph, we here consider the emergence of the “census” as referred in 
books is depicted in Figure 1, where the peak is in the mid-1970s. 3 


1.2 Theory-driven micro-level data 


After World War II, sample surveys began to be widespread to study population 
processes, following up on earlier development during the 1930s. During this period, 
“Demographers at the Bureau of Census, in collaboration with applied statisticians, 
began to develop sampling methods for meeting demands for timely measures of 
unemployment levels” [29]. By the end of the 1950s, at least in the United States, 
sample surveys have become central in social science research, including population 
research. By the end of the 1960s statistical packages have become available to 


3 A similar trend over time could be found when restricting the search to “population census”. 
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analyze what were the “big Data” of that epoch. Theory-driven micro-level data are 
the paradigmatic ones in this second demographic data paradigm. 

The World Fertility Survey (WFS), coordinated by the London office of the In- 
ternational Statistical Institute (ISD, is the first major attempt to field a comparable 
sample survey across a wide range of countries, with 61 countries fielding a WFS 
between 1973 and 1984. At the outset of the programme, Sprehe, on behalf of the ISI 
states the basic aim of the WFS: “to provide scientific information that will permit 
each participating country to describe and interpret its population’s fertility” [32]. 
All WES have to include “independent variables” that measure factors affecting fer- 
tility, to be included in micro-level statistical analyses. The choice of these factors 
is based on existing fertility theories, and builds on earlier survey-based research. 

Since the WFS, sample surveys have become a prime, if not the major, source 
for population research during the last quarter of the Twentieth Century, in particular 
for what concerns family and fertility research. Demographers have also engineered, 
through formal demography, ways to exploit limited and defective data in order to 
estimate population-level parameters from information available through, for in- 
stance, the Demographic and Health Surveys (DHS), the successor of the WFS for 
developing countries. This “demographic estimation” approach has been recently 
and systematically illustrated by Moultrie and colleagues [24]. 

While traditional formal demography remains significantly anchored at the macro- 
level within this paradigm, there is a parallel development in which micro-level out- 
comes become the target of demographic research. The emergence of micro-level 
data, and of micro-level outcomes, as a central target in the study of population pro- 
cesses, and therefore to the second data paradigm in demography, is linked to the 
role of how micro- and meso-level factors influence demographic choices. Statistics 
comes to support this micro-level focus, and the 1972 article by David Cox [7] pro- 
vides an elegant and general regression-based approach to life-table, three centuries 
after Graunt. 

To quantify the rise — and fall — of this second paradigm, we look at two trends. 
First, the presence of the ngram “life table” or “life tables” as compared to “pro- 
portional hazards” in Figure 2. By the early 2000s “proportional hazards” basically 
reached the frequency of “life table” and seems to have started its decline. Second, 
in Figure 3, we show the rise and fall of the WFS as compared to the DHS and the 
Fertility and Family Survey between 1970 and 2008. 


1.3 Data-driven discovery meets theory-driven discovery 


The key critique to the second demographic data paradigm, is that it leads to for- 
get population-level processes, the ultimate object of population scholars [21]. The 
idea that a multi-level paradigm should substitute the micro-based one has been 
suggested, among others, by Courgeau and Frank [6]. 

One way to see the link between demography targeting macro- and micro-level 
outcomes, as well as the link between data and theory in population research, is 
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Fig. 2 Ngram prevalence for “life table” and of “proportional hazards” in English books indexed 
in Google books. Source: https: //books.google.com/ngrams 
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Fig. 3 Ngram prevalence for “World Fertility Survey”, “Demographic and Health Survey”, and 
“Fertility and Family Survey” in English books indexed in Google books. Source: https: // 
books.google.com/ngrams 


to cast a two-stage view of demographic research, distinguishing a discovery stage 
from an explanation stage [5]. The first-discovery—stage aims at the production of 
novel evidence at the population level. While description is often abhorred in the so- 
cial sciences, research on population processes has shown that it is fundamental to 
anchor science to solid empirical bases. However, only novel evidence contributes 
to the cumulation of knowledge. The second—explanation— stage aims at develop- 
ing accounts of demographic change and tests how the action and interaction of 
individuals generate what is discovered in the first stage. 

The distinction between discovery and explanation does not refer to the fact that 
discovery on population processes should only be data-driven. However, discovery 
should not only only be theory-driven, as for instance advocated by some social 
theorists [35] who do not give empirical discoveries a proper role in social science. 
The meeting between data-driven and theory-driven discovery has emerged in pop- 
ulation research more recently, with a marriage between demography’s “powerful 
descriptive potential” and causal analysis [25]. A data example of this meeting is the 
effort to link administrative records with theory-based survey data. The Generations 
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and Gender Survey, for instance, in Nordic countries has exploited the available 
administrative records and linked them with the theory-based questionnaire [36]. 

In terms of methods, the key challenge for this third demographic data paradigm 
has been to link micro-level (data and theory) processes with macro-level population 
processes. The spread of computational, agent-based modeling has been seen as a 
potential solution. It is however too early to say whether this approach has made it to 
the core of a paradigm [2]. The systematic approach linking data- and theory-driven 
micro-founded simulation models with data at the population level has however 
become visible on demographic journals [3][17]. 


1.4 A fourth paradigm? 


Are we at the dawn of a fourth demographic data paradigm? Yes and no: we are 
observing debates and trends that are typical of pre-paradigm shifts. The outcome is 
yet to be determined, but we identify three key crossroads. 

Centralization vs Decentralization. Data collection and data interpretation, the 
essence of research, have been so far in the hands of a small minority of experts. 
Internet and the digital revolution have marked a discontinuity in practices of re- 
search. Everyone can potentially collect data for their own use or for research pur- 
poses, in various forms that include collecting genealogical family trees or using a 
mobile phone app to monitor health. The process is completely decentralized. How- 
ever, corporations have emerged to tap into these new sources and bring them to 
a centralized repository. Similarly, the “open science” movement has brought non- 
professionals into the realm of research. Wikipedia is an example of a revolution- 
ary form of mass-collaboration that goes beyond professional scientists to produce 
knowledge. 

Bias vs Variance. Against the backdrop of decreasing survey response rates and 
the increasing availability of non-representative data, we are observing a strong in- 
terest in developing rigorous techniques to make sound inference from biased, non- 
probabilistic samples. For example, Wang et al. (2015)[37] showed that it is feasible 
to forecast election using data from surveys run on the videogame Xbox, if an ap- 
propriate approach that involves post-stratification is used. 

Re-purposing data vs Re-purposing methods. Although there has always been 
data collected for goals other than research, today the mere scale of data that are 
available to anyone is so large that it is driving new directions of research. Re- 
purposing data might become the norm in social sciences and it may lead to the 
development of new methods, as well as re-purposing classical approaches to the 
new “Big Data” context. 
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2 Here comes the Data Revolution 


The potential emergence of a fourth paradigm, has not gone completely unnoticed 
within the demographic research community. The International Union for the Scien- 
tific Study of Population (IUSSP) has joined the movement initiated by the United 
Nations towards a Data Revolution, i.e. a “new international initiative to improve the 
quality of statistics and information available to citizens” [1]*. We shortly address 
two aspects of this Data Revolution: “new” old Big Data and the so-called digital 
breadcrumbs. 


2.1 “New” old Big Data 


Ruggles [30] describes an “explosion” in the availability of “Big Microdata” for 
population research. The approach pushed by Ruggles and his colleagues at IPUMS, 
with a strong basis at the University of Minnesota, is to make micro-level population 
data from censuses and other sources as accessible as possible for other researchers. 
These data should allow unprecedented opportunities for population research over 
time and place, with rich geographical detail. 

Other examples on how old big demographic data can be used in a new way 
use innovative approaches, including crowd-sourcing, to extract information from 
paper-based demographic documents, including hand-written ones [11]. In this case, 
the meeting between demographers and computer scientists has been fruitful and is 
promising in potentially delivering a wide range of (big) micro data about several 
sources. 


2.2 Digital breadcrumbs 


2.2.1 Re-purposing data 


The global spread of Internet and digital technologies, as well as the rapid diffu- 
sion of smartphones, have profoundly transformed our lives. As a consequence of 
the digital revolution, individuals leave an increasing quantity of traces online that 
can be analyzed to advance knowledge on population processes. Here we include 
examples of research that leveraged online data to study the three main components 
of demographic change: fertility, mortality and migration. 

Fertility. Web searches represent the main online data source that has been used to 
study fertility. Reis and Brownstein (2010) show that the volume of Internet searches 
for abortion is inversely proportional to local abortion rates and directly proportional 


4 The two authors of this paper are also co-chairing the IUSSP Scientific Panel on “Big Data and 
Population Processes” established for the 2015-18 period. 
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to local restrictions on abortion [28]. Billari et al. (2013) show that Google searches 
for fertility-related queries, like ‘pregnancy’ or ‘birth’, can be used to predict fer- 
tility intentions and fertility rates several months ahead [4]. Ojala et al. (2017) use 
Google Correlate to detect evidence for different socio-economic contexts related 
to fertility (e.g., teen fertility, fertility of high income households, etc.)[26] One of 
the most important messages of this line of literature is that combining traditional 
data sources with new data, like Web searches, can improve the predictive power of 
demographic models. However, that cannot be done in a naive way as correlations 
between aggregate Web searches and individual intentions may not persist for long 
periods of time. For example, the widely known Google flu approach to track in- 
fluenza symptoms and detect potential outbreaks using Web searches [12] has been 
very useful and successful. However, at times, it also produced largely erroneous 
estimates, typically when the nature of the relationship between searches, news and 
behaviors changed [20]. Thus the results of these models have to be interpreted 
carefully and with caution. 

Mortality. In the context of mortality analysis, the main source of online data 
results from decentralized collaborations that have produced genealogical data sets. 
For example, Fire and Elovici [8] use data collected from the WikiTree website 
to study correlations in lifespans among parents and children, as well as spouses. 
Similarly, Kaplanis et al. (2017) [16] leverage the data produced by enthusiasts of 
genealogy to evaluate population genetics theories on the dispersion of families. The 
key here is that (a) there are digital records that are left behind by people or institu- 
tions and (b) there is a critical mass of people who organize the data in meaningful 
ways for their own purposes and common goals. 

Migration. Trends in international migrant flows have been estimated by track- 
ing the locations, inferred from IP addresses, of users who repeatedly login into a 
Web service (e.g., Yahoo! [39, 34]). Geo-located Twitter tweets have been used to 
integrate the dimensions of internal and international migration [38] and to study 
global mobility patterns [13]. LinkedIn data have proven useful to evaluate trends in 
migration by educational attainment and sector of employment [33]. Google+ data, 
which provide pseudo migration histories, have proven useful to study how migrants 
connect countries within a network of flows. [23] 


2.2.2 Re-purposing methods 


Data science is about data, including re-purposing data. However, above all it is 
about the scientific use of data to advance knowledge. In this section we include a 
couple of examples of applications of classic social science methods and research 
design to the new data environment. 

Demographic calibration. Non-representative digital breadcrumbs have to be 
calibrated against ‘ground truth’ data in order to evaluate biases and model them. 
Zagheni and Weber develop a method that combines the parsimonious perspective 
of model life tables, based on level and shape parameters, with standard calibra- 
tion techniques [39]. The approach is inspired by calibration models for stochastic 
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microsimulations [31]. The underlying idea in the microsimulation literature is that 
simulations may generate estimates of quantities of interest that are biased. Identify- 
ing and modeling the bias is thus key to make statistical inference. We can consider 
social media and the Internet as “laboratories” that produce estimates of quantities 
of interest that are biased, but in a systematic way. Here, “systematic”, means that 
there are hidden, potentially stochastic rules that determine the relationship between 
the online data and the offline quantities of interest. Conditional on a model for the 
bias, statistical inference for the quantities of interest can be made usingtechniques 
like the Bayesian melding [27, 31]. 

Difference-in-differences. In some situations, “ground truth” data do not exist. 
Without any knowledge about the size and the direction of the bias, providing a 
reliable picture for the quantity of interest at one point in time is not possible. In 
these cases, instead of estimating the absolute value of variables of interest, a more 
modest task can be accomplished: estimating relative changes in quantities. This 
can be done using a difference-in-differences approach. A first demographic exam- 
ple includes estimating trends in migration patterns using geo-located Twitter data. 
[38] A second type of application relates to the evaluation of how shocks, like anti- 
immigrant laws, shape public sentiments about migration. [10] 


2.2.3 Can formal demography make a comeback? 


Can digital breadcrumbs offer new opportunities for formal demography? Online 
users form populations that can be analyzed using classic tools of formal demo- 
graphic analysis. In turn, new types of population dynamics generate new questions 
that require new ways of formalization. Here we offer an illustrative example. 

Consider users of a social media platform, like Twitter. The date when customers 
sign up for the service can be interpreted as a birth. The date when they stop tweeting 
for a long-enough interval of time can be interpreted as a death. 

Figure 4 shows the age structure of a sample of active Twitter users (mid-2016). 
The histogram, which is equivalent to a population pyramid, reveals an age struc- 
ture tilted towards ‘young’ users. In other words, it is a population that is growing 
rapidly. With these data only, we cannot say whether the growth in the population 
of Twitter users is driven by bots or real users, or whether it is related to ‘life course 
transitions’ of users, who may be quite active when they sign up, but then stop 
tweeting after a certain interval of time. However, we can use standard demographic 
techniques to estimate an approximate rate of growth in Twitter customers. 

The problem can be stated as follows: given the number of individuals P, at age 
x and P, at age y, at time f, the goal is to find the rate at which the births were 
increasing between years t — x and t — y. It turns out that, under the assumption of 
exponential growth of births, the population rate of growth r is (see Keyfitz and 
Caswell [18]): 


r= ——log( ~~) () 


where L, and Ly are the fraction of people surviving x and y years, respectively. 
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For the illustrative example based on a small sample of Twitter users, one obtains 
an estimated annual growth rate of the Twitter population equal to around 0.3. This 
is extremely fast growth, compared to rates for human populations. 


3 Not a conclusion: The Data Revolution is not a dinner party 


If we believe that a paradigm shift in the study of population processes, around the 
emergence of “Big Data”, is undergoing, it is by definition impossible to make firm 
conclusions. For sure, this “Data Revolution” will not be a dinner party. 

Conventional wisdom will need to be challenged. Existing borders between dis- 
ciplines might become a hindrance to scientific progress. Sticking to traditional 
approaches within the demographic research community might prevent further 
progress, or just let other, bolder, communities of scholars bring the advances 
needed to further our understanding of population processes. These challenges will 
need to be accompanied by new types of training for the younger generations of 
scholars—and perhaps even more relevantly, for the older generations. A fruitful 
way ahead is perhaps to combine traditional approaches with new one: counting 
and now-casting, indirect estimation and the used of non-representative Web-based 
data, official statistics and digital breadcrumbs. 

A bit of patience, despite the speed of the field, is needed. Setbacks will happen 
and mistakes will be made within the “Data Revolution”. Trial and errors are needed. 


Age distribution of a sample of active Twitter users (birth=signing up) 
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Fig. 4 Age distribution of a sample of active Twitter users (mid-2016), where ‘birth’ indicates the 
date when the user signed up. Source: own elaboration of data collected using the Twitter API. 
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Taking a very conservative stance that requires a new paradigm to have fully shown 
its potential in order to legitimize its approaches would however be an even bigger 
mistake. For the study of population processes, the Data Revolution is already here. 
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Bayesian Tensor Regression models 


Monica Billio and Roberto Casarin and Matteo Iacopini 


Abstract In this paper we introduce the literature on regression models with tensor 
variables and present a Bayesian linear model for inference, under the assumption 
of sparsity of the tensor coefficient. We exploit the CONDECOMP/PARAFAC (CP) 
representation for the tensor of coefficients in order to reduce the number of pa- 
rameters and adopt a suitable hierarchical shrinkage prior for inducing sparsity. We 
propose a MCMC procedure via Gibbs sampler for carrying out the estimation, dis- 
cussing the issues related to the initialisation of the vectors of parameters involved 
in the CP representation. 


Key words: Tensor regression, Sparsity, Bayesian Inference, Hierarchical Shrink- 
age Prior 


1 Introduction 


The increasing availability of large sets of data presented in different formats (the 
most general class of examples includes all data that comes as images, such as EEG 
or the outcome of many other medical tests, video recordings and so on) has put for- 
ward some limitations of the existing multivariate econometric models. In the era of 
the so-called “Big Data’, the traditional mathematical representations of informa- 
tion in terms of matrices has some non-negligible drawbacks, the most remarkable 
of them being the difficulty of accounting for the structure within the data. When 
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the information is available in the form of a collection of matrices, or in higher- 
order structures, such as tensors (e.g. in text or image processing), one approach to 
inference relies on vectorizing the object of interest by stacking all the elements in a 
column vector, which is then studied by means of multivariate analysis techniques. 
Though this way is well established in the literature, it is suited for dealing with low 
dimensional and unstructured arrays, but is not advisable in higher dimensions. By 
stacking all the elements of the object of interest in a long unidimensional array, 
we lose the structural information encrypted in the original shape of the variable. 
In other words, the physical features of the data matter since the value contained, 
for example, in a cell of a matrix is highly likely to depend on the values of a sub- 
set of the whole matrix; however, the process of vectorization does not allow to 
preserve this kind of information. Thus the introduction of novel methods able to 
treat 2-dimensional or multidimensional data as they are, that is, without modify- 
ing their shape by vectorization, still an open challenging question in statistics and 
econometrics. 

Matrix models in econometrics have been employed over the past decade, espe- 
cially in time series analysis where they have been widely used for the state space 
representation of these models. However it is only recently that the attention of the 
academic community has moved towards the study of matrix models. Continuing 
the stream of literature on time series models, [5] utilized these tools for studying 
dynamic linear models. Other fields of application include the analysis of Gaussian 
graphical models and the classification of longitudinal datasets. 

A different stream of literature concentrates on tensor regression models and can 
be divided into two categories, according to the specification of the model. First of 
all, linear models as in [7] and [6] generally include in a regression function the 
scalar product between a tensor 2° € R“*--*¢> and a tensor of coefficients 4 € 
Rd1*--x4b. More in detail, [6] propose a multivariate model with tensor covariate 
for longitudinal data analysis; whereas [7] uses a generalized linear model with 
exponential link and tensor covariate for analysing image data. 

Following a different purpose, [3] generalizes the univariate or multivariate re- 
gression by allowing both the response and the covariate to be tensor-valued. He 
exploits the Tucker product, which has been originally developed as a tensor rep- 
resentation method, then follows the Bayesian approach for the estimation. From a 
frequentist perspective, the literature is still limited, partly due to the highly complex 
optimisation problems involved in the estimation process, which generally relies on 
iterative maximum likelihood procedures. 

Motivated by the need for new methodologies able to deal directly with two- or 
higher-dimensional variables, we propose a new linear regression modelling frame- 
work well suited for data that are available in the shape of tensors, as both the 
response variable and the covariate are concerned. The general model we propose 
is shown to encompass both univariate and multivariate regression as special cases. 
Furthermore, we address the issue of dimensionality by the exploitation of a suitable 
parametrization which enables to achieve both parameter parsimony and to incorpo- 
rate sparsity in the coefficients. For what concerns inference, the Bayesian approach 
is very appealing in this framework as it allows the necessary flexibility while re- 
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taining analytical and computational tractability. Therefore adopt this perspective 
and provide a Monte Carlo Markov Chain (MCMC) procedure for carrying out the 
estimation. 

The main contribution of this paper is to provide a unifying framework for ex- 
isting econometric models, which generalises to higher-dimensional and structured 
variables. From a computational perspective, we focus on the issues related to the 
initialization of the Gibbs sampler for the vectors of parameters involved in the 
CONDECOM/PARAFAC (CP) representation of the tensor of regression coeffi- 
cients. 

The remind of the paper is the following: in Section 2 we present the model 
and briefly discuss its most relevant characteristics. The inferential approach is then 
outlined in Section 3 and the results of the estimation process based on a simulated 
dataset are given in Section 4. Finally, we draw conclusions and give an outline of 
current research in Section 5. 


2 Bayesian Tensor Regression Model 


Define a tensor as a generalisation of a matrix into a D-dimensional space, namely: 
I € R44, where D is the order of the tensor and d ‘j is the length of dimen- 
sion j. Clearly, matrices, vectors and scalars are particular cases of tensor variables, 
of order 2, 1 and 0, respectively. The common operations defined on matrices and 
vectors in linear algebra can be applied also to tensors (henceforth, to be intended 
of order > 3), via slight generalisations in their definition. Moreover other opera- 
tors and representation can be defined on tensors which are not defined on lower 
dimensional objects. For a remarkable survey on this subject, see [4]. 

The general tensor linear regression model (see [1] for greater details) we present 
here can manage covariates and response variables in the form of vectors, matrices 
or tensors. It is given by: 


Y= A+ Bro vec(R)+C XD Aut BR +E, GE May .udp(0,21---,2d) 
(1) 
where the tensor response and errors are given by %, & € R°1*-x%; while the 


è X X “i Ww $ 
covariates are 2% € R *--xdu, W, ER&*% and z; € R&. The coefficients are: A € 
Rx. xd, BE RUX-XdpxP | E € Rx. xdpxd, De Ra Xd Xd xdn41...xdp 


where p = [];d*. The symbol x, stands for the mode-n product between a tensor 
and a vector, as defined in [4]. 

Notice that this model provides a generalization of several well-known econo- 
metric linear models, among which univariate and multivariate regression, VAR, 
SUR and Panel VAR models and matrix regression model (see [1] for formal 
proofs). 


We focus on the particular case where both the regressor and the response vari- 
ables are square matrices of size k x k and the error term is assumed to be distributed 
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according to a matrix Normal distribution: 


Y, = B xzvec(X;) +E, Ey M%(0,Ec,Z,)- 2) 
In this case the coefficient is a three dimensional tensor Z € R% , Since the 
model is overparametrised, in order to provide a significant reduction of the number 
of parameters we assume a CONDECOM/PARAFAC (CP) representation (more 
details in [4] and [1]) for the tensor, as follows: 


B=} B=} BY o...0Bp), 3) 


where the vectors pP € R4 Vj =1,...,D are also called margins of the CP rep- 
resentation and R is ‘the CP-rank of the tensor. Since estimation of R is a NP-hard 
problem, in the following we are assuming a fixed value for it. The CP decompo- 
sition permits a significant reduction of the number of parameters of the tensor of 
coefficients. For a D order tensor with length d; of dimension i, it decreases from 
112, di to REP, di. The corresponding gain we obtain by making this assumption 
in model (2) consists in a reduction of the complexity from @(k*) to O(k?(R+1)). 


3 Bayesian Inference 


We follow the Bayesian approach for inference, thus we need to specify a prior dis- 
tribution for all the parameters of the model. The adoption of the CP representation 
for the tensor of coefficients is crucial from this point of view, as it allows to re- 
duce the problem of specifying a prior distribution on a multi-dimensional tensor, 
for which very few possibilities are available in the literature, to the standard multi- 
variate case. In fact it suffices to define a prior distribution for all the margins: this 
can be done in a very flexible way by using multivariate distributions. As a conse- 
quence, we are allowed to embed the prior knowledge of sparsity of the coefficient 
by the choice of a suitable hierarchical shrinkage prior. 


Building from [2], we define a prior for each of the margins po of the tensor 
coefficient 4 by means of the following hierarchy: 


(p? IW, o, T) ~ M; (0, TW)  Vr=1,...,R Vj=1,2,3 (4) 
da BAG 2) Yr=1,...,R Yj=1,2,3 Yp=1,...,d; (5) 
(0) ~ Dir(a,..., 0) Vr=1,...,R (6) 

(Tt) ~ Falar, br) (7) 


The idea behind this prior construction is to induce sparsity of the tensor 4 in 
a flexible way via the introduction of hyperparameters accounting for the different 
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levels. The component Wj, is a dj x dj diagonal matrix whose entries (wp jbo) 
represent the individual (local) share of the variance; instead @, introduces sparsity 
at a medium level by shrinking all the r-th margins of the CP representation in 
eq. (3). Finally, T provides a global control on the variance, common for all the 
vectors. 

We complete the prior specification by assuming two Inverse Wishart as prior 
distributions for the covariance matrices of the error term: 


(Er) ~ IW (Ves) (8) 
n(Xe) CE IWx(Ve, Ve) (9) 


In compact notation, the joint prior distribution is given by: 


1(0) = 7(#|W,6,)ax(W)ax(p)a(t)a(L.)m(k,). (10) 


Given a sample (Y,X) = {¥;,X,}7_, and defining x; = vec (X;), the likelihood 
function of the model (2) is given by: 


L B okk die z 
J27) 2 |Z| |2] tepl 5%: 1Y, — B x3 x) E; m -2x)}. 
t=1 


(11) 
The details of the Gibbs sampler along with the analytical derivation of the full 
conditionals are given in [1]. 


4 Simulation 


The model poses a problem for the initialisation of the vectors {f i };.-» since there 
is no guidance in which could be a suitable starting value for the Gibbs sampler 
(which is known to be sensitive to the initial point). A naive way to initialise the 
vectors is to use an accept/reject algorithm based on a proposal corresponding to the 
prior distribution. However this approach has been proven to converge slowly due to 
the low acceptance rate of good starting values. Instead, we address this issue by ini- 


tialising the vectors { BO} j,r With the outcome of a simulated annealing algorithm. 
This is a stochastic optimisation algorithm close in spirit to the Metropolis-Hastings 
algorithm: by the choice of a tempering process, at the initial iterations it makes big 
moves on the domain, allowing exploration of the parameter space, while at suc- 
cessive steps the reduction of the temperature contracts the range where optima are 
looked for, until convergence. For a suitable tuning of the tempering scheme, this 
algorithm is able to provide the global optimum. In practice, it delivers good starting 
points for the Gibbs in fast computing time. 
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We performed a stimulation study by drawing a sample of T = 100 couples 
{¥,,X;}: of square matrices of dimension k = 10. The regressor is built by entry- 
wise independent AR(1) processes: 


xiji — H = Oj(xiji 17 MW) + Niji Nije ~ AN (0,1) 


(12) 
Vija = 104 Bijxija + Eijs &ji ~ MN (0,1) 


where E[ni;j.Mx1,v] = 0, E[Eij Eki, v] = 0 and E[n; sE] = 0, V (i, j) 7 (k, 1) and Yt # 
v; moreover E[7;;;]. In addition the coefficients are drawn from oj; ~ Y(—10, 10) 
and Bij oo U(-1, 1). 

We demeaned the simulated data, then we initialised the marginals of the tensor 
B by simulated annealing and run the Gibbs sampler for N = 10000 iterations. 
As an indicator of the goodness of fit of the estimated parameters, we computed 
the Frobenious norm between the original tensor and the one reconstructed via the 
posterior of the marginals. The outcome is shown in Fig.(1): in blue it is shown the 
trace plot, while the red curve is the progressive mean across iterations. In order 
to reduce the autocorrelation of the posterior sample, we performed thinning by 
keeping one observation every 50, the final autocorrelation function after this step 
is plotted in Fig. (2). 


Fig. 1 Trace plot 
(blue) and progres- 
sive mean (red) of 
| ggirue _ Bpost|| 


Fig. 2 ACF of recon- 
structed tensor from 
B,,B2,B3 after thin- 
ning, by keeping one 
simulated value every 
50. 


We report also the results for the parameter T, which drives the global component 
of the shrinkage, in Fig. (3): the trace plot and progressive mean indicate the con- 
vergence of the algorithm, however even in this case the autocorrelation between 
iterations is present, though lower than for the marginals. This is due to the fact 
that the B,,8,,3 are sampled individually by approximating their joint distribu- 
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Fig.3 Trace plot (blue) and 
progressive mean (red) of T. 


tion via the full conditionals of each single Bi For the sake of comparison with 


the previous graph, in Fig. (4) we report the autocorrelation function of the posterior 
sample of T thinned by taking one value every 50. Mayor details on the results of 


the simulation are reported in [1]. 


WESTERSE 


Fig. 4 ACF of q, after thinning by keep- 
ing one simulated value every 50. 


We are currently working on simulations for models of bigger size as well as for 
applying this methodology for the study of temporal dynamics of real networks. 


5 Conclusions 


We propose a linear regression model for matrices which is a generalisation of stan- 
dard econometric models and allows each entry of the covariate to exert a different 
effect on each entry of the response. The model is a reduced form of a general ten- 
sor regression, nonetheless all the analytical and computation results discussed are 
directly applicable to the more general form. In particular, the delicate issue of the 
initialisation of the sampler has been carried out by means of an efficient imple- 
mentation of simulated annealing. The model has been testes on a synthetic dataset, 
showing good performance in the reconstruction of the true tensor of coefficients. 
We plan to apply the methodology to real network datasets in order to study their 
temporal dynamics. 
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Bayesian nonparametric sparse Vector 
Autoregressive models 

Modelli Autoregressive multivariati: un approccio 
Bayesiano nonparametrico con sparsita 


Monica Billio and Roberto Casarin and Luca Rossini 


Abstract Seemingly unrelated regression (SUR) models are useful in studying the 
interactions among economic variables. In a high dimensional setting, these mod- 
els require a large number of parameters to be estimated and suffer of inferential 
problems. To avoid overparametrization issues, we propose a hierarchical Dirichlet 
process prior (DPP) for SUR models, which allows shrinkage of coefficients toward 
multiple locations. We propose a two-stage hierarchical prior distribution, where the 
first stage of the hierarchy consists in a lasso conditionally independent prior of the 
Normal-Gamma family for the coefficients. The second stage is given by a random 
mixture distribution, which allows for parameter parsimony through two compo- 
nents: the first is a random Dirac point-mass distribution, which induces sparsity in 
the coefficients; the second is a DPP, which allows for clustering of the coefficients. 


Abstract I modelli di regressione (SUR) sono utili per studiare le interazioni tra 
variabili economiche di interesse. Quando si lavora con grandi dimensioni, questi 
modelli richiedono la stima di un gran numero di variabili e soffrono di problemi 
inferenziali. Per evitare i problemi di sovraparametrizzazione, noi proponiamo una 
prior di tipo Dirichlet gerarchico (DPP) per il modello SUR, dove si permette la 
contrazione dei coefficienti attraverso diverse posizioni. Noi proponiamo una prior 
gerarchica a due stages, dove il primo stage consiste in una lasso prior per i co- 
efficienti dalla famiglia delle Normali-Gamma. Il secondo stage usa una mistura 
di distribuzioni random attraverso due componenti: la prima è una distribuzione di 
Dirac su un punto, che induce sparsità per i coefficienti; la seconda invece si basa 
su una DPP, che permette la clusterizzazione dei coefficienti. 


Key words: Bayesian nonparametrics, Bayesian model selection, Shrinkage, Large 
vector autoregression. 
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1 Introduction 


In the last decade, high dimensional models and large datasets have increased their 
importance in economics (e.g., see [8]). The use of large dataset has been proved 
to improve the forecasts in large macroeconomic and financial models (see, [1], 
[3], [5], [9]). For analyzing and better forecasting them, SUR/VAR models have 
been introduced [11, 12], where the error terms are independent across time, but 
may have cross-equation contemporaneous correlations. SUR/VAR models require 
estimation of large number of parameters with few observations. In order to avoid 
overparametrization, overfitting and dimensionality issues, Bayesian inference and 
suitable classes of prior distributions have been proposed. 

In this paper, a novel Bayesian nonparametric hierarchical prior for multi- 
variate time series is proposed, which allows shrinkage of the SUR/VAR coef- 
ficients to multiple locations using a Normal-Gamma distribution with location, 
scale and shape parameters unknown. In our sparse SUR/VAR (sSUR/s VAR), some 
SUR/VAR coefficients shrink to zero, due to the shrinking properties of the lasso- 
type distribution at the first stage of our hierarchical prior, thus improving efficiency 
of parameters estimation, prediction accuracy and interpretation of the temporal de- 
pendence structure in the time series. We use a Bayesian Lasso prior, which allows 
us to reformulate the SUR/VAR model as a penalized regression problem, in order 
to determine which SUR/VAR coefficients shrink to zero (see [10] and [7]). 

As regards to the second stage of the hierarchy, we use a random mixture distri- 
bution of the Normal-Gamma hyperparameters, which allows for parameter parsi- 
mony through two components. The first component is a random Dirac point-mass 
distribution, which induces shrinkage for SUR coefficients; the second component 
is a Dirichlet process hyperprior, which allows for clustering of the SUR/VAR co- 
efficients. 

The structure of the paper is as follows. Section 2 introduces the vector au- 
toregressive model. Section 3 describes briefly the Bayesian nonparametric sparse 
model. Section 4 presents some simulation results for different dimensions. Section 
5 concludes. 


2 The Vector Autoregressive model 


Let y; = (Yj ;5---,Ywv,)’ € R” be a vector-valued time series. We consider a VAR 
model of order p (VAR(p)) as 


Pp 
y=b+) Biyi-i+ £r, (1) 
i=1 


fort =1,...,7, where y; = (Y1 t,- - -,Ymt)', b= (b1,..-, Dm)’ and B; is a (mx m) ma- 
trix of coefficients. We assume that €; = (€1,,..., Emi) follows a independent and 
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identically distributed Gaussian distribution -/%,,(0, 2) with mean 0 and covariance 
matrix X. 
The VAR(p) in | can be rewritten in a stacked regression form: 


Yr = (In®x)B+€, (2) 
where x; = (1,y/_1,.-.,);—p)/ is the vector of predetermined variables, B = vec(B), 
where B = (b,B1,...,Bp), ® is the Kronecker product and vec the column-wise 


vectorization operator that stacks the columns of a matrix in a column vector. 


3 Bayesian nonparametric sparse VAR 


In this paper we define a hierarchical prior distribution which induces sparsity on 
the vector of coefficients p. In order to regularize (2) we incorporate a penalty us- 
ing a lasso prior f(B) = IT;-1-/#(B;|0,y, T), where YY(B|u, y,7) denotes the 
normal-gamma distribution with location parameter 4, shape parameter y > 0 and 
scale parameter T > 0. The normal-gamma distribution induces shrinkage toward 
the prior mean of u, but we can extend the lasso model specification by introducing 
a mixture prior with separate location parameter 4¥, separate shape parameter 7; 
and separate scale parameter T*¥ such that: f(B) = IT, N9(B;\ ui, 7}, €). In our 
paper, we favor the sparsity of the parameters through the use of carefully tailored 
hyperprior and we use a nonparametric Dirichlet process prior (DPP), which reduces 
the overfitting problem and the curse of dimensionality by allowing for parameters 
clustering due to the concentration parameter and the base measure choice. 

In our case we define 0* = (u*, y*, T*) as the parameters of the Normal-Gamma 
distribution, and assume a prior Q; for @};, that is 


BE NG (Bilut Yt), 6) 
9,10, Q, (4) 


for j= 1,...,r, and/=1,...,N. 
Following a construction of the hierarchical prior similar to the one proposed in 
[4] we define the vector of random measures 


Qi (d6,) = ™ Po(d@;) + (1 = m)P,(d0,), 


; (5) 
Qw(dOn) = Tv Po(dOn) + (1 = Tv )Pn(dOn), 


with the same sparse component Po in each equation and with the following hierar- 
chical construction as previously explained, 


Po(d9) ~ d{(0,10,70)} (d(H, Y, T)), 
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P,(40)“£ DP(a,Go), 1=1,...,N, (6) 


usi “4 Be(m|1,04), 1=1,...,N, 


(10, To) ~ 8(%; To Vo, Po, so; No), 
Go ~ N (ule,d) x 8(Y, TV1, p1i,s1,m) 


where d{y,} (W) denotes the Dirac measure indicating that the random vector y has 
a degenerate distribution with mass at the location Wo, and g(%, To) is the conjugate 
joint prior distribution (see [6]). We apply the Gibbs sampler and the hyperparame- 
ters given in [2] for the posterior approximation. 


4 Simulation Results 


The nonparametric prior presented in Section 3 allows for shrinking the SUR co- 
efficients. In order to assess the goodness of the prior we performed a simulation 
study of our Bayesian nonparametric sparse model. We consider different datasets 
with sample size T = 100 from the VAR model of order 1: 


y,=By,,+&, E Ap(0,5) t=1,...,100, 


where the dimension of y, and of the square matrix of coefficients B can take dif- 
ferent values: m = 20 (small dimension), m = 40 (medium dimension) and m = 80 
(large dimension). Furthermore, we choose different settings of the matrix B, focus- 
ing on a block-diagonal structure with random entries of the blocks: 


e the block-diagonal matrix B = diag{B1,...,B, Ja} E Mmm) iS generated with 
blocks B; (j = 1,...,m/4) of (4 x 4) matrices on the main diagonal: 


biij. Diaj 
Ali Dot 


bat,j --. baa,j 


where the elements are randomly taken from an uniform distribution Y (— 1.4, 1.4) 
and then checked for the weak stationarity condition of the VAR; 

e the random matrix B is a (80 x 80) matrix with 150 elements randomly cho- 
sen from an uniform distribution Y (—1.4, 1.4) and then checked for the weak 
stationarity condition of the VAR. 


Figure 1 exhibits the posterior mean of A, which shows us the allocation of the 
coefficients between the two random measures Po and Py. In particular, we have 
that the white color indicates if the coefficient d;; is equal to zero (i.e. sparse com- 
ponent), while the black one if the d;; is equal to one, for nonsparse components. 
The definition of the pairwise posterior probabilities and of the co-clustering matrix 
for the atom locations u allows us to built the weighted networks (see Figure 2), 
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- "ro . 
40 i so m a 


5 10 15 20 25 EJ a 40 10 20 30 40 50 e0 70 #0 


(a) m=40 (b) m = 80 with random numbers 


Fig. 1 Posterior mean of the matrix of ô for m = 40 (left) and for m = 80 (right) with random 
element. 


where the blue edges represent negative weights, while the red ones represent the 
positive weights. In each coloured graph the nodes represent the n variables of the 
VAR model, and a clockwise-oriented edge between two nodes i and j represents a 
non-null coefficient for the variable y;,-1 in the i-th equation of the VAR. 


K7 


(a) m = 40 (b) m = 80 with random numbers 


Fig. 2 Weighted network for m = 40 (left) and for m = 80 (right) with random elements, where 
the blue edges means negative weights and red ones represent positive weights. 
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5 Conclusions 


This paper proposes a novel Bayesian nonparametric prior for SUR models, which 
allows for shrinking SUR coefficients toward multiple locations and for identify- 
ing groups of coefficients. We introduce a two-stage hierarchical distribution, which 
consists in a hierarchical Dirichlet process on the parameters of the Normal-Gamma 
distribution. The proposed hierarchical prior is used to proposed a Bayesian non- 
parametric model for SUR models. We provide an efficient Monte Carlo Markov 
Chain algorithm for the posterior computations and the effectiveness of this algo- 
rithm is assessed in simulation exercises. The simulation studies illustrate the good 
performance of our model with different sample sizes for B and y,. 
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Using GPS Data to Understand Urban Mobility 
Patterns: An Application to the Florence 
Metropolitan Area 


Analizzare i comportamenti di mobilita urbana attraverso 
i dati GPS: un’applicazione all’area metropolitana 
fiorentina 


Chiara Bocci, Daniele Fadda, Lorenzo Gabrielli, Mirco Nanni Leonardo Piccini 


Abstract Big Data, originating from the digital breadcrumbs of human activities, 
let us observe the individual and collective behaviour of people at an unprecedented 
detail. In this paper we investigate the informative potential of the digital tracking 
that GPS-enabled devices can offer to academic research and to policy makers, with 
a specific attention for urban and metropolitan settings. The unstructured nature of 
the dataset requires a careful consideration and correction of possible biases which 
could lead to unreliable results. We use the 2011 census commuting matrix as a vali- 
dation tool for our proposed methodology. GPS data contain information that would 
not be otherwise available, i.e. non-systematic mobility patterns. The produced es- 
timates are then used to analyse mobility patterns within the Florence Metropolitan 
Area in a more exhaustive and detailed form. 

Abstract L’evoluzione tecnologica ha portato, nel corso degli ultimi anni, ad un 
notevole incremento dei dispositivi in grado di produrre e memorizzare tracce dig- 
itali dei nostri comportamenti quotidiani. In questo lavoro vogliamo indagare il 
potenziale informativo contenuto nelle tracce prodotte da apparecchi dotati di sis- 
temi GPS per scopi di ricerca o di pianificazione delle politiche, con particolare 
riferimento all’ambito urbano e metropolitano. La natura spontanea e non strut- 
turata dei dati richiede un’attenzione particolare alle possibili fonti di distorsione. 
Utilizziamo la matrice di pendolarismo del censimento 2011 come strumento di 
validazione dei risultati. I dati GPS contengono informazioni altrimenti di difficile 
reperibilità, come i comportamenti di mobilità non-sistematica. Le stime ottenute 
sono utilizzate per un caso di studio incentrato sull’Area Metropolitana Fiorentina. 


Key words: big data, mobility, urban planning, O-D matrix validation 
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1 Introduction 


Technological evolution brought along, in recent years, a remarkable increase in 
the diffusion of devices that can record digital footprints of our behaviour on a 
daily basis, tracking a vast degree of activities. Constant and basically unintentional 
production of such tracks generates huge datasets that contain a precious quantum 
of information about socio-economic behaviour that may be extracted and used for 
socio-economic research and for policy analysis [1]. 

Big data sources may support policy makers in the ex-ante phase of policy im- 
plementation, by providing a more sophisticated depiction of the socio-economic 
environment and may be used for ex-post evaluation purposes in quasi-experimental 
design and counterfactual settings. 

Literature on the matter and practical experiences have highlighted pros and cons 
of this approach [5]. Some of the pros include timeliness, cost effectiveness, spatial 
and temporal disaggregation, emergence of unexpected and/or unobservable phe- 
nomena. On the other hand, since the relative novelty of the methodologies used 
to deal with these data, extra carefulness needs to be used to acknowledge possible 
shortcomings in terms of quality, accessibility, applicability, relevance, privacy pol- 
icy and ownership of the data, all of which may affect the quality of policy evalua- 
tion and appraisal. Nonetheless, we believe that big data sources can be successfully 
used to foster the capabilities of the public institutions to deal with complex prob- 
lems, to plan effective policies and to evaluate the outcomes of their actions. To this 
extent, we propose a methodology that allows us to use data collected from GPS- 
enabled devices, installed on private vehicles for insurance purposes, to analyse and 
understand mobility patterns within a urban setting. 


2 Research statement, objectives and data sources 


The aim of the paper is to find a viable method to use GPS data to produce a 
non-biased Origin-Destination matrix for the selected study area, i.e. the Florence 
Metropolitan Area. Since the GPS dataset is derived from private car mobility, our 
focus will be mainly on this type of flows. However, since we want to use our es- 
timates to assess the intensity and characteristics of the relations between different 
geographic zones within the metropolitan area, we need to find a way to correct 
for the different propension on public transport usage which we expect to observe 
across the different Origin-Destination pairs. 

Typically, this kind of data is collected systematically every 10 years, during 
the nationwide official census. However, census data, while very rich with informa- 
tion and details, has two major drawbacks: the temporal lag between census, during 
which we have no information on mobility, and the focus on what we call systematic 
mobility, i.e. the mobility which happens almost every day and is mainly related to 
home-to-school or home-to-work trips, leaving out an increasingly relevant segment 
of non-systematic mobility, which, by its own nature, is difficult to capture with tra- 
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ditional methods. If our methodology is correct, we can thus increase our analytical 
capability with an informative base that can be updated almost continuously and 
that includes all mobility and not only the systematic one. 

For this study we use GPS data that are provided by a leader company in the In- 
surance Telematics that deals with about the 2% of the total vehicles circulating in 
Italy. Our dataset counts about 150k private vehicles crossing Tuscany in the month 
of June 2011, and represents a primary source of information for studying the mo- 
bility behaviours. Data on vehicle fleet in Italy provided by the Italian Automobile 
Club (ACI), Census data provided by ISTAT, and trip duration and distances with 
different transportation means computed using Google services are used to re-scale 
the vehicle sample to the real mobility flows. Once we validate our data and esti- 
mate a reliable O-D matrix, we can use the data to carry out an extensive descriptive 
analysis of mobility patterns in our selected geographic area. To demonstrate the in- 
formative potential of this kind of data, we choose the Florence Metropolitan Area 
as a case study. 


3 Estimating a detailed O-D Matrix using GPS data 


As we previously discussed, GPS data contain an inherent bias: they account only 
for private cars usage (specifically, for the fraction of vehicles that have a GPS de- 
vice installed for insurance purposes and that are being monitored by our provider). 

Since we want to use GPS data for socio-economic analysis and for policy plan- 
ning and evaluation, we need to find a way to scale back the flows that we observe 
towards our real population, which means accounting for (at least) three missing 
dimensions: 


1. We observe vehicles, but we want to estimate the number of people actually 
travelling, which means accounting for average car occupation; 

2. We observe a fraction of vehicles that is geographically heterogeneous, so we 
want to account for different market penetration by our provider; 

3. We observe only private cars, so we want to account for an heterogeneous share 
of public transport users. 


In order to estimate a complete O-D Matrix, we use the 2011 Census Origin- 
Destination Matrix as a validation tool. Such matrices are usually released with a 
territorial detail that corresponds to the administrative units of municipalities. These 
matrices contain information on municipality of origin, municipality of destination, 
time of departure, duration of the trip, mean of transport, gender and purpose of 
the trip (work- or school-related). A geographically more detailed matrix is also 
released by ISTAT, with a disaggregation to the census zones, but with less informa- 
tion on the characteristics of the trip (only the purpose). Since we want to be able to 
estimate an O-D Matrix to analyse urban areas, we want a sub-municipality disag- 
gregation for lager municipalities. We therefore use the more detailed matrix for our 
validation. Since we also need at least the mean of transportation for our validation 
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we split our flows using the share of public transportation registered between the 
corresponding municipalities. 

Our starting dataset is comprised of systematic trips observed over the month of 
June 2011 and aggregated by the 2011 Census zone partitions. If we hypothesise 
our data to be a random sample extracted from the population of all car movements 
happening within Tuscany borders during our time frame, we can estimate our target 
values (a census zone O-D matrix of people using all available means of transporta- 
tions) with the following formula 


X Flow, j = flow; j* car. pen; * avg.occ; * public.t.ind; j 


where X Flow is our desired estimate from zone i to zone j, flow is our observed 
flow from zone i to zone j, car.pen is the market penetration of our data provider 
for the municipality within which zone i falls, avg.occ is the average occupancy rate 
for systematic mobility departing in municipality within which zone i falls (derived 
from census data), public.t.ind is a public transport accessibility index calculated 
between zone i and zone j (calculated using google services). 


4 Validating GPS data using the Census O-D Matrix 


Once we estimate our O-D Matrix, we want to check how our estimates perform 
against our reference values, i.e. the 2011 census O-D Matrix. Literature on matrix 
comparison has produced different indicators that asses how similar two matrices 
are (see [2] to an detailed presentation and discussion of these indicators). Moreover, 
one recent thread of research has been trying to evaluate the similarity of O-D ma- 
trices by using image quality assessment techniques mutuated from image process- 
ing methodologies [6, 7]. We test the performance of our estimated matrix applying 
different measures: some classical statistical indicators (like the R? association mea- 
sure, the Root Mean Square Error (RMSE) and the Pearson x test), the Geoffrey E. 
Havers (GEH) statistic which evaluate the level of closeness of each pair of cell of 
the two matrices, and the recently proposed (and still under study) Mean Structural 
Similarity Index (MSSIM) by Van vuren and Day-pollard [6], which compare two 
O-D matrices considering the means, variances and covariance of contiguous matrix 
cells evaluated within a moving block of cells in each matrix. 


5 Using the estimated Matrix for socio-economic analysis 


Once we have validated the estimation, we can use our matrix to produce a variety 
of indicators that can help policy makers understand the connection and mobility 
patterns which operate within their territories. The case study area that we selected 
is the Florence Metropolitan Area. 
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5.1 Filling the gaps and assessing mobility patterns 


We can use our methodology to estimate an O-D Matrix for subsequent years and 
compare the results as a time series. Moreover, since we can unpack our matrix in 
a spatially detailed manner, we can assess mobility patterns within the municipality 
boundaries. As an example, in Figure 1 we can determine the average speed of the 
observed trajectories as it varies hourly within the day and for different partition of 
the city (in this case, the 5 administrative neighbourhoods of the city of Florence). 


Distribuzione oraria traiettorie in uscita 


Ora del gemo 


Fig. 1 Average speed by hour and zone of departure 


5.2 The boundaries of the city 


Generally the border of the city are measured looking at just census data i.e. pop- 
ulation density in absolute terms, or the variation over time [4, 3]. We propose a 
clustering approach aimed at partitioning territories on the basis of human move- 
ments inferred using Big Data. 

The aim of our work is to contribute to this debate, by providing a tool for policy 
makers to build a novel definition of regions, seen as functional areas. Focused on 
the Metropolitan Area of Florence, we aggregate territories that maximise internal 
traffic and minimise external one. 

Given two generic nodes a and b, we define internal traffic, the sum of the flows 
from a to b and vice versa. For each pair a,b we calculate the distance matrix as the 
percentage of internal flow respect to the total flows. The clustering methods seeking 
the best partitioning minimises the distances contained in the matrix provided as 
input, so each pair of the distance matrix is calculated as d = 1 — %internal flows. 

We provide the distance matrix as input to DBSCAN and we evaluate the possible 
values of epsilon and we extract the territorial partition shown in Figure 2. 
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Fig. 2 Boundaries of Florence Metropolitan Area using GPS data 


6 Conclusions and future research 


The proposed methodology allows us to reliably use GPS data for urban mobility 
behaviour analysis, without relying on the snapshot provided every ten years by 
the national census. The informative potential of this source is very high and flex- 
ible. Future lines of research include expanding the methodology to validate non- 
systematic data and further validation using GSM data (call records from mobile 
phones usage). 
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Relative privacy risks and learning from 
anonymized data 


Privacy e learning in dati anonimizzati 


Michele Boreale and Fabio Corradi 


Abstract We consider group-based anonymized tables, a popular approach to data 
publishing. This approach aims at protecting privacy of the involved individuals, by 
releasing an obfuscated version of the original data, where the exact correspondence 
between individuals and attribute values is hidden. When publishing data about in- 
dividuals, one must typically balance the /earner’s utility against the risk posed by 
an attacker, potentially targeting individuals in the dataset. Accordingly, we pro- 
pose a MCMC based methodology to learn the population parameters from a given 
anonymized table and to analyze the risk for any individual in the dataset to be 
linked to a specific sensitive value when the attacker has got to know the individ- 
ual’s nonsensitive attributes. We call this relative risk analysis. Finally, we illustrate 
results obtained by the proposed methodology on a real dataset. 

Abstract Nel lavoro consideriamo tabelle anonimizzate realizzate per rendere 
disponibili informazioni sulla popolazione, nascondendo pero l’attribuzione dei 
dati sensibili ai singoli rispondenti. Si valuta l’informazione sulla popolazione che 
rimane disponibile e il rischio di violare i la privacy dei rispondenti, fornendo di- 
verse forme di apprendimento e di valutazione. Vengono riportati i risultati di un 
esperimento condotto su un dataset reale. 


Key words: Privacy, anonymization, k-anonymity, MCMC methods. 
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Table 1 A table (left) anonymized according to Local recoding (center) and Anatomy (right). 
ID| Nat. ZIP | Dis. ID} Nat. | ZIP | Dis. GID| Nat. ZIP Dis. 
1 [Malaysia|45501| Heart 1 [{M,J}[4550*| Heart 1 | Japan |45502|| Heart 
2| Japan |45502| Flu 2 |{M,J}|4550*| Flu 1 |Malaysia|45501|| Flu 
3| Japan |55503| Flu 3 | Japan |5550*| Flu 2 | Japan {55504 Flu 
4| Japan |55504|Stomach| | 4 | Japan [5550*|Stomach 2 | Japan |55503][Stomach 
5 | China |66601| HIV 5 |{C,J}[66601| HIV 3 | Japan |66601{| HIV 
6| Japan |66601|Diabetes| | 6 |{C,J} [66601|Diabetes 3 | China (66601 ||Diabetes 
7| India |77701| Flu 7 {{I,M}[77701| Flu 4 |Malaysia|77701|| Flu 
8 |Malaysia|77701| Heart 8 |{I,M}|77701| Heart 4 | India |77701|| Heart 
Original table Local recoding Anatomy 


1 Introduction 


It is a common practice to release datasets involving individuals in some 
anonymized form. The goal is to enable the computation of population character- 
istics with reasonable accuracy, at the same time preventing leakage of sensitive 
information about individuals in the dataset. We are interested in group-based tech- 
niques, put forward in Computer Science in the last 15 years or so: k-anonimity 
[5] and its variants, like ¢-diversity [2], and Anatomy [8]. Despite their weakness 
against attackers with strong background knowledge, these techniques are a com- 
mon choice when it comes to table publishing [3]. In group-based methods, the 
anonymized or obfuscated version of a table is obtained by partitioning records in 
groups enjoying certain properties (see Section 2). Generally speaking, even know- 
ing that an individual belongs to a group of the anonymized table, it will not be 
possible for an attacker to link an individual to a specific sensitive value in the 
group. Two examples of group based anonymization are in Table 1, adapted from 
[7]. The original table collects medical data from eight individuals; here Disease 
is considered as the only sensitive attribute. The central table is a 2-anonymous ta- 
ble, obtained by local recoding: within each group, the nonsensitive attributes are 
generalized so as to make them indistinguishable. This is an example of horizontal 
scheme. Generally speaking, each group in a k-anonymous table consists of at least 
k records, which are indistinguishable as far as the nonsensitive part is concerned. 
Finally there is an example of application of Anatomy: within each group, the non- 
sensitive part of the rows are vertically randomly permuted, thus breaking the link 
between sensitive and nonsensitive values. 

We put forward a probabilistic model to reason about the relative risk posed by 
the release of anonymized datasets (Section 2), i.e. the leakage of sensitive informa- 
tion for an individual in the table, beyond what is implied for the general population. 
To see what is at stake here, consider the central table of Fig. 1. An adversary may 
reason that, with the exception of the first group, a Japanese is never connected to 
Heart Disease. This hint can become a strong evidence in a larger, real-world table. 
Suppose now that the attacker’s target, a Malaysian living at ZIP code 4550*, is 
known to belong to the table, so he must be in the first group. On the basis of the 
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evidence about Japanese not suffering from Heart Disease, the attacker can then link 
with high probability his target to Heart Disease. Here, the attacker combines knowl- 
edge learned from the anonymized table and about his victim with the group struc- 
ture of the table itself. To formally reason about this phenomenon, we will define 
the relative privacy risks by comparing two conditional probability distributions, en- 
coding respectively: what can be learned about the population from the anonymized 
table; and what can be learned about a the victim, given knowledge of her/his non- 
sensitive attributes and presence in the table (Sections 3). Generalizing Kifer [1] 
and Wong et al. [7], we propose a MCMC to learn both the parameter’s population 
and the attacker’s probability distribution from the anonymized data (Section 4). We 
finally illustrate the results of an experiment on a real-world dataset (Section 5). 


2 Group based anonymization schemes and the probabilistic 
model 


Given a dataset of N individuals, let @ and .7, ranged over by r and s, be finite 
nonempty sets of nonsensitive and sensitive values. A row is a pair (s,r) € S x Z. 

In a group based scheme a cleartext table is an arrangement of a multiset of N 
rows, say d = (51,/1),...,(5v,fn), into a sequence of groups, t = g1,..., 2%, where 
each group is a sequence gj = (Sj fji) +++ (Sin sin; ). Given a generic group g, its 
obfuscation is a pair g* = (l,m), where m = $1,52,... is the sequence of sensitive 
values occurring in g, and /, called generalized nonsensitive value, is: 


e a superset of g’s nonsensitive values for horizontal schemes (e.g. k-anonymity); 
e the multiset of g’s nonsensitive values {| r1,72,...|}, for vertical schemes. 


Given a table t = g1,...,gx, an obfuscated table is a t* = g/,...,g;, such that each 
8; is an obfuscation of the corresponding group gj. An anonymization algorithm A 
is a — possibly probabilistic — mechanism that maps collections of N rows, d, into 
obfuscated tables, t*. 

Our model consists of the following random variables with the associated mean- 
ing. 


e II, taking values in the set of full support probability distributions 2 over .7 x 
&: the (unknown) joint probability distribution of the population. 

e T=Gy),...,Gx, taking values in the set of tables. Each group Gy; is in turn a 
sequence of n; consecutive rows in T, Gj = (Sj, Rj), (Sjita;;Ri+a;); the 
number k of groups is not fixed, but depends itself on the rows $;,Rj; 

e T* =Gj,...,G{, taking values in the set of obfuscated tables. 


We assume that the above three random variables form a Markov chain IT —> 
T —> T*. In other words, the joint probability density f of these variables can be 
factorized as: 


Str) = ff. (1) 
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We also assume the following. 


e 7€ F is encoded as a pair of (Ts, Tps) such that f(s,7]7) « f(s|zs)f(r|7,s). 
Here, each 7s is a distribution over .7, and each 7g\5 is viewed as a collection 
of distributions over Z, Tris = (Trjs)se.. We posit that the 7s and the pjs are 
chosen independently, according to Dirichlet distributions of hyperparameters 
a =(0,...,0,7)) and ° = (BF, > Bia): respectively. In other words 


f(x) = Dir(as| æ): [| Dire, |B°). (2) 
SES 


e The N individual rows composing the table f, (51,r1),..., (Sv,rv) are assumed to 
be drawn i.i.d. conditionally to II. This amounts to positing that: 


Fim) = f(si rim) f(s, rn|7) - (3) 


3 The honest learner, the attacker and measures of relative risk 


A honest learner is someone who, after observing T* = t*, updates his knowledge 
on the population parameters 7. In addition an attacker also knows the nonsensitive 
value ry of a victim in T. In what follows we shall fix once and for all t* and ry such 
A : Ea DAI 
that f(rv,t*) = f(rv occurs in T , T* = t*) > 0. Let pr (s,r) be the joint probability 
distribution on the population that can be learned given from t*. Formally, for each 


(s.r) 
purit) È Emgar = f_ Soira A 


Of course, we can condition pr on any given r so also the victim’s nonsensitive 
attribute ry and obtain the corresponding distribution on .7. 


ust) È Erp] = f_ Fsbo) F(a) ax. © 


Given knowledge of ry and knowledge that the victim is in T, we can define 
the attacker’s distribution on .Y as follows. Let us introduce a random variable V, 
identifying the victim as one of the individuals in T. In other words, V is is an index, 
which we posit is a priori uniformly distributed on 1..N, and independent from II, 7. 
Recalling that each row (Sj, Rj) is identified by a unique index j, we can define the 
attacker’s probability distribution on .7, after seeing #* and ry, as: 


A x 
Pa(s|rv,t*) = f(Sv =s | Ry =r, t*). (6) 
Theorem 1 provides pa (s|ry,t*) only based on the marginals R; given ¢*. 


Theorem 1. Let T = (Sj, Rj) je1..v and sj the sensitive value in the row j oft*. Then 
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pallat) XL SR =r). 7) 


j:sj=s 
We now define some measures of relative privacy risk to be put at work in Section 
5. 


Definition 1 (risk measures). Let p a full support distribution on -Y and (s, r) a row 
in t. We say this row is at risk under p if p(s) = maxy p(s’), and that its risk level 
under p is p(s). For an individual row (s,r) in t, which is at risk under pa(-|r,t*), its 


relative risk level is R(s,r,t,t*) = pals, For l € {L, A}, let us define (using the 


multiset notation {| --- |}) Ne(t,t*) 2 l{|(s,r) €t : (s,r) is at risk under po(-|r,t*) |}. 
The global relative risk of t given t* is: GR(t,t*) È max fo, Na (ot?) Mus") } 


4 Gibbs sampling 


For real world datasets, none of the distributions (4), (5) or (7) will be computable 
analytically. Nonetheless, we can build accurate estimations of these distributions 
from samples of the marginals of the density f(7,t|t*), with £ = g1,...,g (note 
that here the sensitive values s; are actually fixed and known from ¢*). This can done 
using a Gibbs sampler, provided we can effectively sample from the full conditionals 
of x and gj, for 1 < j < N. This is discussed below. 

The Gibb’s chain state sequence (z’',t'), i = 0, 1,..., is defined in the usual way, 
starting from an initial state xo = (7°,1°) and sampling in turn 7° and each of the 


groups of t! = gi, «+, & Separately, from the respective full conditionals. From equa- 
tions (1), (2) and (3), it is easy to check that: 


Falt, t) = f(alt) (8) 

Sgt) = f(gm)f(gl) AS F<. (9) 

Each of the above two relations enables sampling from the corresponding full 
conditional on the left-hand side. Indeed, (8) is a posterior Dirichlet distribu- 
tion, from which effective sampling can be easily performed. Denote by y(t) = 
(N, YA) the vector of the frequency counts y; of each s; in t. Similarly, given 
s, denote by `° (t) = (6),..., dz) the vector of the frequency counts dj of the pairs 


(ri,5), for each r;, in t. Then, for each 7 = (75, Tris)» we have 


f(a|t) = Dir(as | æ +y(t))- J] Dir(ze,; | B° +8). 
SES 
Let us discuss now (9). Here we will confine ourselves to the important case when 


the following conditions are satisfied: (a) the obfuscation function is deterministic, 
so that f(g}|t) equals 0 or 1; (b) the set Y of the g;’s such that f(g*|g;,t_;)= 1 
depends solely on gj = (Jj,m;), and is given by 


_ J {8 = (sur), (Sum) ire ely for 1<0<n} (horizontal schemes) 
 V{g= (sur); (Sn Fin) + fOr Fi,- Fi, a permutation of mj } (vertical schemes). 


(10) 
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This assumption is exact in many important cases (e.g. Anatomy) and reasonable 
in the remaining ones. Under assumptions (a), (b) and (10) above, sampling from 
(9) amounts to drawing an element gj € Y with probability œ f(g;|7). This can be 
achieved via different techniques in each of the two cases of interest, horizontal and 
vertical; the details are omitted here due to lack of space. 


5 Experiments 


We have put a proof-of-concept implementation of our methodology at work on 
a subset of the Adult dataset from the UCI machine learning repository [6]. The 
considered subset consists of 5692 rows, with the following categorical attributes: 
sex, race, marital status, education, native country, workclass, salary class, occu- 
pation, with occupation considered as the only sensitive attribute. Using the ARX 
anonymization tool [3], we have obtained three different anonymized versions of 
the considered dataset, enjoying k-anonymity for, respectively: k = 4, k = 5 and 
k = 10. The average size of the groups varied from 38 rows (for k = 4) to 355 rows 
(for k = 10). We run the Gibbs sampler on each of these three anonymized datasets. 
We obtained the following figures for the global relative risks (cf. Def. 1) of the 
three datasets: GR; = 3.98%, GR; = 1.7% and GR3 = 1.86%. In absolute terms, 
the fraction of rows of t* correctly classified by the attacker ranged from 27.3% to 
29.4%. The maximal relative risk level R ranged from about 1.9 to 3.93. 

All in all, these results indicate that, in each case the considered anonymized 
datasets imply a significant relative privacy risk, for an appreciable fraction of the 
rows. 
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A stochastic volatility framework with analytical 
filtering 


Giacomo Bormetti, Roberto Casarin, Fulvio Corsi and Giulia Livieri 


Abstract Motivated by the fact that realized measures of volatility are affected by 
measurement errors, we introduce a new family of discrete-time stochastic volatility 
models having two measurement equations relating both the observed returns and 
realized measures to the latent conditional variance. 


Key words: Bayesian Inference, Monte Carlo Markov Chain, High-frequency, Re- 
alized volatility, ARG, Stochastic volatility 


1 Introduction 


In this paper we introduce a new family of discrete-time Stochastic Volatility (SV) 
models, for the joint modelling of returns and realized measures of volatility. The 
proposed model is characterized by having two measurement equations for the latent 
volatility: (i) a Normal density for the daily returns and (ii) a Gamma density for 
the RV measure. We then term the general version of the proposed model as SV- 
ARG. A salient feature of the SV-ARG is that it allows for analytical filtering 
and smoothing recursions for the latent factor that guides the dynamics of daily 
returns. This permits us to develop an effective Bayesian inference procedure for 
both parameters and latent factor. 
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2 The model 


Consider a financial log-return process r,, a realized variance process y; and a latent 
volatility process y. Let F, = o (r,,y;) be the o-algebra containing the information 
about observable quantities (log-return and realized variance y;) available at time #, 
and FH = 0(F;_1,h;). We assume the following model for the dynamics of the 
log-returns: 


r =u + yh +e, &" (0,1), (1) 


t=1,...,7, where yp is the risk-free rate and yis the market price of risk. “M (m, o?) 
indicates the univariate normal distribution with mean m and variance 0?. The dy- 
namics in Equation (1) differs from that employed in Corsi et al. (2013); Majewski 
et al. (2015) for daily log-returns inasmuch in these works authors consider as driv- 
ing process for returns a realized measure of volatility. Specifically, they employ 
the continuous part of the realized variance, hereafter RV, defined as the sum of 
squared returns over non-overlapping intervals within a sampling period. We refer 
to Equation (1) as return equation. 

Since the RV contains information on the latent volatility process, we follow au- 
thors in Hansen and Lunde (2006); Engle and Gallo (2006); Shephard and Sheppard 
(2010); Takahashi et al. (2009) and introduce another measurement equation, termed 
realized variance equation, which relates the observed RV to the latent process hy. 
Specifically, we assume that the realized variance y; is sampled from a Gamma dis- 
tribution 


GH i.d. 


Vi F; (a, h), (2) 


where a € R+ is constant. In the previous equation, 4 (k, è) denotes a Gamma 
distribution with positive shape, k, and scale parameter, 0. 

We assume that h; follows an autoregressive gamma process with transition distri- 
bution (see Gouriéroux and Jasiak, 2006): 


io: d =- 
h| Fey RE L Flv, Ehio). (3) 


In the previous equation, Y(v, hi ,c) denotes the non-central gamma distribution 


with shape v > 0, scale c > 0 and non-centrality ha. Using the Poisson mixture 
representation for the non-central gamma distribution (see Gouriéroux and Jasiak, 
2006, for more details), we rewrite Equation (3) as 


i.d 
hi |Z È G(v+z%,c), 
id z 
zr|h-1 n P (Ph), 
where, in general, o(v) indicates the Poisson distribution with intensity parameter 


v. The latter representation is useful for both the characterization of h; and the in- 
ference procedure. The stationarity conditions, the conditional moment generating 
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function of this process and its risk neutral dynamics are given in (Bormetti et al., 
2016). 


3 Analytical filtering and smoothing 


Applying similar argument as in Creal (2015), we are able to provide analytical 
expressions for the: (i) conditional likelihood, (ii) Markov transition, (iii) initial dis- 
tribution of z;, (iv) filtering and the smoothing of the latent process h. In particular, 
the following two propositions hold. 

Proposition 1. For the SV-ARG model described in Equation (1), (2) and (3) the 


conditional likelihood, p(1;,y;|Zr,@), the Markov transition, p(z:\Z1—1, 11-1, Yr-1, 9), 
and the initial distribution of z, p(z1;9), are respectively given by: 


Au) 
(*) 
rn) =a (029) ( «| | 


; (a) c 
Plek t-i J130) & SF (reneo) 
c $ 


P(z1;0)< VB (v,69) ; 


with 
exp (yu) vi! 1 
iO) = 3 
N (2915 9) Vin TaI Via E 
Hr =r H, 
O = a, 


A(z) =V+z-G-1/2, 


x® = wh +24, 
Lot = Vi, 


2 
y=V+-. 
Cc 
Proof. See Bormetti et al. (2016). 
Proposition 2. Let A(z,), 7) and y be the quantities defined in Proposition 1. The 


marginal filtered, p(h;|¥1:1,¥1:2,Z1-21,X1:1; 9), and smoothed, p(h;|¥1:7,Y1:7 ,Z1:7,X1:73 9) 
distributions are 


(hye yin Br X13 8) Lig (A(z),2.W), 


(d) 
p(hi|ti:T,y1:T,Z1:T,X1:7;0) « Gig (26042000 w+2! ) i; 


e 
t=1,---,T. 
Proof. See Bormetti et al. (2016). 
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4 Simulation results 


For the SV-ARG model we simulate 50 data-series of 1,000 observations. For each 
data-series we run the Gibbs sampler in Bormetti et al. (2016) for 100,000 itera- 
tions, discard the first 20,000 draws to avoid dependence from initial conditions, 
and finally apply a thinning procedure to reduce the dependence between consec- 
utive draws. We test the efficiency of the algorithm in three different scenarios: 
LOW-PERSISTENCE (3 = 0.3), MEDIUM PERSISTENCE ( = 0.6), and finally, HIGH 
PERSISTENCE ( = 0.9). The true values for the other parameters used in the sim- 
ulations are reported in Table l1together with the grand average of the parameter 
posterior means along with their robust standard deviations. The results in Table 1 
indicates the accuracy of the MCMC scheme is remarkable for all the scenarios 
(LOW PERSISTENCE, MEDIUM PERSISTENCE, HIGH PERSISTENCE). As regards 
the efficiency, the magnitudes of the inefficiency factor after applying a thinning 
procedure are below ten. 


Table 1 SUMMARY OUTPUT OF THE PARAMETER ESTIMATES FOR THE SV-ARG MODEL 
LOW PERSISTENCE MEDIUM PERSISTENCE HIGH PERSISTENCE 
@ TRUE ESTIMATE STD ESTIMATE STD ESTIMATE STD 

0.0 0.0018 0.0118 -0.0051 0.0177 -0.0074 0.0358 

1.0 1.0552 0.0738 1.0523 0.0720 1.0685 0.0784 
È 0.8428 0.0572 0.8327 0.0575 0.8474 0.0647 
0.8 0.8033 0.0371 0.7981 0.0394 0.8182 0.0576 
1.0 0.9654 0.0938 0.9706 0.0909 0.9395 0.0790 
0.3118 0.0595 0.6376 0.0746 0.9702 0.0839 
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5 Conclusions 


Motivated by the presence of measurement errors in the empirically computed re- 
alized volatility measures we introduce a new family of discrete-time models. We 
derive the analytical filtering and smoothing and show that they can be used for 
efficient inference on the parameters and the latent volatility process. 


Acknowledgements All authors warmly thank Drew D. Creal for helpful comments on the imple- 
mentation of the algorithm for computing the Bessel function of the second kind and Dario Alitab 
for support during the development of the pricing code. The research activity of RC is supported 
by funding from the European Union, Seventh Framework Programme FP7/2007-2013 under 
Grant agreement FP7/2007-2013, and by the Italian Ministry of Education, University and 
Research (MIUR) PRIN 2010-11 Grant MISURA. GL acknowledges research support from the 
Scuola Normale Superiore Grant SNS_14_BORMETTI and CI14_UNICREDIT_MARMI. This 
research used the SCSCF multiprocessor cluster system at University Ca’ Foscari of Venice. 


A stochastic volatility framework with analytical filtering 209 


References 


Bormetti, G., Casarin, R., Corsi, F., and Livieri, G. (2016). Smiles at errors: A 
discrete-time stochastic volatility framework for pricing options with realized 
measures. Working Paper, University Ca’ Foscari of Venice. 

Chib, S., Nardari, F., and Shephard, N. (2002). Markov chain Monte Carlo methods 
for stochastic volatility models. Journal of Econometrics, 108(2):28 1-316. 

Corsi, F., Fusari, N., and La Vecchia, D. (2013). Realizing smiles: Options pricing 
with realized volatility. Journal of Financial Economics, 107(2):284-304. 

Creal, D. D. (2015). A class of non-Gaussian state space models with exact likeli- 
hood inference. Journal of Business & Economic Statistics, (just-accepted). 

Engle, R. F. and Gallo, G. M. (2006). A multiple indicators model for volatility 
using intra-daily data. Journal of Econometrics, 131(1):3-27. 

Gouriéroux, C. and Jasiak, J. (2006). Autoregressive gamma processes. Journal of 
Forecasting, 25(2):129-152. 

Hansen, P. R. and Lunde, A. (2006). Realized variance and market microstructure 
noise. Journal of Business & Economic Statistics, 24(2):127-161. 

Majewski, A. A., Bormetti, G., and Corsi, F. (2015). Smile from the past: A gen- 
eral option pricing framework with multiple volatility and leverage components. 
Journal of Econometrics, 187(2):521-531. 

Shephard, N. and Sheppard, K. (2010). Realising the future: Forecasting with high- 
frequency-based volatility (HEAVY) models. Journal of Applied Econometrics, 
25(2):197-231. 

Takahashi, M., Omori, Y., and Watanabe, T. (2009). Estimating stochastic volatility 
models using daily returns and realized volatility simultaneously. Computational 
Statistics & Data Analysis, 53(6):2404-2426. 


Estimating Italian inflation using scanner data: 
results and perspectives 


L’uso degli scanner data per la stima dell’inflazione: 
risultati e prospettive 


Alessandro Brunetti, Stefania Fatello, Federico Polidoro 


Abstract Scanner data coming from the retail trade outlets of modern distribution 
represent a crucial challenge for the inflation indicators. Istat is actively participating 
in the European project aimed at obtaining and processing scanner data to compile 
HICP. Since the end of 2013 a stable cooperation has been set up among Istat, 
Association of modern distribution, retail trade chains and Nielsen in order to 
provide Istat with scanner data. For 2014, 2015 and 2016, scanner data of grocery 
products have been collected by Istat through Nielsen for about 1400 outlets of the 
main six retail trade chains for 37 provinces. For 2016 and 2017 scanner data of 
about 2100 outlets for the entire national territory will be available. In sight of the 
adoption on large scale of scanner data to estimate inflation, scheduled for January 
2018, experimental HICPs/CPIs of one ECOICOP group (non-alcoholic beverages) 
for two provinces are compiled using scanner data. A comparison with the indices 
currently released is carried out, providing some preliminary evaluation about the 
impact of the new sources of data on inflation estimation. Issues concerning formula 
of indices are dealt with and those regarding missing observations, imputations and 
replacements are explored. 

Abstract Gli scanner data provenienti dai punti vendita della grande distribuzione 
organizzata costituiscono un'opportunità cruciale per la stima dell’inflazione. 
L’Istat partecipa al progetto europeo finalizzato all acquisizione ed elaborazione di 
queste informazioni per il calcolo degli indici dei prezzi al consumo, avviando, dalla 
fine del 2013, una stretta collaborazione con l’Associazione Distribuzione Moderna, 
le catene della grande distribuzione e Nielsen. Per gli anni 2014, 2015 e 2016, 
l’Istat ha acquisito i prezzi dei prodotti grocery di circa 1400 punti vendita, delle 
principali sei catene della grande distribuzione organizzata, per 37 province. Per il 
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2016 e 2017 é prevista la fornitura di dati scanner riferiti a oltre 2100 negozi, a 
copertura dell’intero territorio nazionale. In vista dell’adozione su larga scala degli 
scanner data per la stima dell’inflazione, prevista dal 2018, il paper sviluppa un 
confronto tra indici ufficiali e indicatori sperimentali (basati sugli scanner data) di 
un gruppo ECOICOP (bevande analcoliche) di due province, fornendo una prima 
valutazione dell’impatto delle nuove fonti sulla stima dell’inflazione. L’analisi si 
concentra poi sulle formule utilizzate per il calcolo degli indici ed esplora i 
problemi relativi al trattamento delle mancate risposte, alle imputazioni e alle 
sostituzioni di prodotto. 


Key words: Scanner data, inflation, modern distribution, dynamic approach 


1 Introduction (extended abstract) 


Scanner data coming from the retail trade outlets of modern distribution represent 
a crucial challenge for the inflation indicators. Istat is actively participating in the 
European project aimed at obtaining and processing scanner data to compile HICP. 
Since the end of 2013 a stable cooperation has been set up among Istat, Association 
of modern distribution, retail trade chains and Nielsen in order to provide Istat with 
scanner data. For 2014, 2015 and 2016, scanner data of grocery products have been 
collected by Istat through Nielsen for about 1400 outlets of the main six retail trade 
chains for 37 provinces. For 2016 and 2017 scanner data of about 2100 outlets for 
the entire national territory will be available. 

In sight of the adoption on large scale of scanner data to estimate inflation, 
scheduled for January 2018, experimental HICPs/CPIs of one ECOICOP group 
(non-alcoholic beverages) for two provinces (Rome and Turin) are compiled using 
scanner data. Experimental indices are calculated starting from unit values of, on 
average, more than 75,000 product-offers! available in about 300 outlets included in 
the samples selected for the two provinces for years 2015-2016 (in paragraph 2 a 
detailed description of the dataset used is provided). The aggregation process of 
elementary data is addressed in paragraph 3, in which particular attention is devoted 
to the formula used to calculate micro-indices. As discussed in the paper, the choice 
of the aggregation method, at the lowest level of calculation of indices, strictly 
depends on the approach used to define the sample of product offers (references) 
within the single outlets. Specifically, in what follows a dynamic approach to 
sampling is adopted: according to this methodology, the set of product-offers 
selected in each outlet varies from month to month. In this framework, monthly 


' “product-offer” or “reference” mean a specific item, tagged by a GTIN code, sold 
in a specific outlet for which information on turnover and quantities are available. 
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indices at the very first step of aggregation are calculated by chaining monthly 
relatives based on the same sample over two adjacent months. 

In the fourth paragraph of the paper, the results of the present analysis are 
discussed. A preliminary estimate of the impact of the new sources of data on 
inflation is provided by comparing the experimental indices of the two chief towns 
with the corresponding indicators currently released. 

Conclusions of the paper focus the attention on the open issues (above all how to 
deal, in the monthly selection of references, with temporarily not available product- 
offers, replacements and seasonal goods), tracing in general the solution that Istat is 
going to adopt and the future development of the project. 


2 Description of the dataset 


Scanner data provided by the six main retail trade chains represent, at national 
level, about 57% of the turnover of modern distribution. Istat receives weekly data 
for each outlet distinguished by outlet-type (hypermarket and supermarket) for food 
and grocery products (excluding fresh with variable weight). 

The analysis has been carried out on all outlets of the provinces of Rome and 
Turin delivered by Nielsen for two years 2015 and 2016. Table 1 contains the 
number of outlets by retail chain, outlet-type and province in each year considered. 


Table 1: Number of outlets by retail chain, outlet-type and province (2015-2016) 
2015 2016 


Rome Turin Rome Turin 


ChainGDO Hyper Super Hyper+Super Hyper Super Hyper+Super Hyper Super Hyper+Super Hyper Super Hyper+Super 
Conad 4 38 4 2 10 12 4 33 37 2 10 12 
Coop Italia 7 8 15 6 19 25 7 8 15 6 19 25 
3 
3 


Esselunga - - - 3 - 3 

Auchan 4 27 31 3 6 9 4 26 30 
Carrefour Italia 4 7 75 13 25 38 4 71 75 13 26 39 
Selex - 26 26 1 34 35 - 26 26 1 7 8 
Total 19 170 189 28 94 122 19 164 183 28 67 95 


Scanner data coming from retail chains contain weekly data on turnover and 
quantities sold per item code or GTIN. GTIN (Global Trade Item Number) is the 
current name of the barcode, formerly known as EAN, and the most commonly used 
code when dealing with scanner data. GTIN identifies a unique product and 
consistently refers to the same product over time. Therefore unit value prices per 
item code can be derived as the average of prices actually paid by households for 
products. Per each GTIN weekly price (weekly unit values) can be calculated 
dividing weekly turnover with weekly quantities and monthly prices (monthly unit 
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value) can be calculated dividing monthly turnover with monthly quantities. 
CPIs/HICPs can be calculated using the first three full weeks of the month but in the 
following analysis the indices are calculated considering also two weeks or only one. 
The underlying hypothesis is that the missing weeks are estimated with the others 
weeks of which information are available. 

To go up from each GTIN to the ECOICOP lowest level of classification, it was 
necessary to pass through the lowest level of ECR classification (ECR-market), that 
is the classification of products shared by the industrial and distribution companies 
and to which each GTIN code is attributed. Istat mapped ECR-markets (about 1600 
voices) to Italian ECOICOP-6 level (consumption segments), so that GTINs are 
automatically classified within the ECOICOP-6 level. 

In each consumption segment there are a variable number of ECR-markets with 
very different turnover shares. ECOICOP group 01.2 “Non-alcoholic beverages” has 
been chosen for the analysis and in table 2, for each ECOICOP-6 level belonging to 
this group, the corresponding number of ECR-markets and turnover shares of all 
outlets of Rome and Turin in the years 2015 and 2016 are reported. ECR-market 
within each specific outlet is defined the elementary aggregate (EA) and this is the 
lowest level witch the elementary indices are calculated using scanner data. Table 3 
shows the average number of GTIN within COICOP-6 level used to calculate 
elementary indices in both provinces for each year. 


Table 2: Number of ECR-markets within COICOP-6 level with turnover shares by province (2015-2016) 


2015 2016 

Coicop-6 level Description N° markets Rome Turin Rome Turin 

01.2.1.1.0 Coffee 13 24,0 27,0 24,2 27,5 
01.2.1.2.0 Tea 12 5,7 6,2 5,9 6,3 
01.2.1.3.0 Cocoa and powdered chocolat: 3 1,5 1,4 1,4 1,4 
01.2.2.1.0 Mineral or spring waters 15 31,8 28,0 32,4 28,9 
01.2.2.2.1 Carbonated soft drinks 13 18,5 17,4 18,3 17,0 
01.2.2.2.2 Other soft drinks 8 5,0 7,3 4,8 6,8 
01.2.2.3.0 Fruit and vegetable juices 43 13,6 12,8 13,0 12,0 
Total 107 100,0 100,0 100,0 100,0 


Table 3: Number of GTINs within COICOP-6 level by province (2015-2016) 


2015 2016 

Coicop-6 level Description Rome Turin Rome Turin 

01.2.1.1.0 Coffee 8.722 5.360 9.190 5.010 
01.2.1.2.0 Tea 7.077 5.039 7.239 4.467 
01.2.1.3.0 Cocoa and powdered chocolate 953 666 984 609 
01.2.2.1.0 Mineral or spring waters 6.132 3.253 6.196 3.048 
01.2.2.2.1 Carbonated soft drinks 7.700 4.920 7.987 4.442 
01.2.2.2.2 Other soft drinks 5.182 3.148 5.066 2.694 
01.2.2.3.0 Fruit and vegetable juices 12.606 7.875 12.847 7.120 


Total 48.372 30.259 49.508 27.390 
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3 Index calculation formulas 


In compliance with the dynamic approach, a sample of those product-offers that 
are present in both the current and the preceding month is monthly selected in each 
outlet. Particularly, the dynamic basket of references is obtained by using a set of 
filters to select a matched sample each month comparing the current month with the 
preceding month. To this aim, the following three different filters are considered: 

e A dump filter that removes products where strong decreases in price and 

turnover/quantities; 

e An outlier filter that removes prices that drop/increase above certain 

thresholds 

e A low-sales filter that filters out item codes with very low sales (the low- 

sales filter is empirically determined so that the selected item codes 
represent about 80% of turnover at the ECR-market level) 

The EA index is calculated on the basis of the matched set of representative item 
codes, classified in a given ECR-market, that are actually sold in two subsequent 
periods in a given outlet. Specifically, in each ECR-market, in each outlet, an 
unweighted Jevons index is calculated over the current and preceding month as 


follows: 
=; ia 
mt \1 (smi) 
pim-Ditmt _ Pn 
Jev Ta m-i,t 
nesmi trn 

where: 
pir si; DI cha is the price relative between month m-1 and m, 


SM LE is the set of representative items sold in month m-1 and m, 
¢(S™—+*) is the number of representative item codes in 5™7 4, 


The EA chain-linked index is then as follows: 


iz sat j ) 

pmt = Pa ai Pa sn ovo Pt 

je T Ot x it x x m=Lt 
meset Pn \ne sit Pn nesm-it Pn 


It has to be noted that relaunches (new versions of the same item with some 
superficial differences and a new item code) and replacements (discounts that 
receive a new item code; products of a certain brand replaced by similar products of 
another brand) form a potential problem for this sampling method as the system does 
not automatically link a disappearing product-offer with its relaunch or replacement. 
For this reason, algorithms have to be implemented in order to detect and treat 
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relaunches and replacements appropriately (by combining old and new product- 
offers and then calculate unit value indices for the combination). Moreover, prices 
for item codes that are not present in subsequent periods should be imputed to ensure 
seasonal items re-enter the index at the correct time. 

For the calculation of experimental indices commented in this paper, however, 
neither the treatment of relaunches/replacements nor the imputation of absent 
references have been taken into account. 

Concerning the ECOICOP experimental indices (3 and 4 digits level), they are 
calculated as Laspeyres indices, through successive aggregations of EA indices: 

1) Firstly, the ECR-market indices for different outlet type (hypermarket and 
supermarket) at the retailer’s level are obtained as weighted arithmetic mean 
of EA chain-linked indices, with weights proportional to the share of 
turnover of the concerned outlets. 

2) The ECR-market indices of the two outlet types (hypermarket and 
supermarket) are then compiled by aggregating the corresponding indices of 
the different retailers (weights proportional to turnover shares). 

3) The provincial ECR-market indices are calculated by aggregating the ECR- 
market indices of hyper and supermarket (weights proportional to turnover 
shares). 

4) Finally ECOICOP indices results from the aggregation of ECR-market 
indices (weights proportional to turnover shares at the 7 digits level, and 
HICP weights for 6 to 4 digits indices). 


4 Results 


Figure 1 and figure 2 show the first results of the present analysis for the group 
considered and one class (mineral waters, soft drinks, fruit and vegetable juices). 
The annual rates of change calculated with experimental scanner data indices are 
compared with the corresponding indicators calculated with territorial data collection 
(provincial HICPs for Rome and Turin are compiled for this purpose). 

In both provinces the figures show a similar trend for the two annual rates of 
change compared. Two main evident differences emerge from a preliminary analysis. 
The first one is the more regular trend of the indices compiled with scanner data. The 
second one is the higher inflation registered by the experimental indices. As a matter 
of fact, annual rate of change on average in 2016 of prices of non-alcoholic 
beverages in Rome is +0.4% with scanner data and 0.0% with territorial data (in 
Turin +0.4% versus -0.3%) and the same results emerge for mineral waters, soft 
drinks, fruit and vegetable juices (respectively +0.6% versus -0.3% in Rome and 
+0.6% versus -0.2% in Turin). 

Although this empirical evidence is circumscribed, it shows that the use of big 
amount of data (as those represented by scanner data) that better cover time (not just 
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collected once a month as in the territorial data collection) and space (a huge number 
of outlets with respect to present sample currently used, all the GTINs belonging to a 
segment of consumption not just the most sold product-offer) should allow eliminate 
the volatility coming from the limits of the present territorial data collection. 

Moreover the results obtained show that the sign of the impact on inflation 
estimation deriving from the adoption of scanner data, at least locally, could be 
different (up) from what is expected (down). 

It is clear that there is not possibility to generalize the results obtained. The 
reason lies not only on the limited test (two ECOICOP aggregations, two towns) but 
on the issues concerning the use of scanner data whose solutions, for the time being, 
have not been implemented: replacements, treatment of relaunches and seasonal 
goods, imputations, combing indices calculated with scanner data coming from 
modern distribution with indices compiled for traditional distribution. 

It is just worth to note that, for what concerns replacements and relaunches, the 
filter used in dynamic approach adopted in this test ensures good representativeness 
of the sample in terms of turnover. 


Figure 1: Harmonized index of consumer price of non-alcoholic beverages. M/M-12 rates of changes - 


Year 2016 
Rome Turin 
2 2 
1 1 
0 0 
1 1 
2 2 
é Se Pa ; Ri Pd s Fi £ È g F Pd é Pi # ; oe È £ F 5 Pa ; Pa a 
— SD — Terr —D — Ter. 


Figure 2: Harmonized index of consumer price of mineral waters, soft drinks, fruit and vegetable 
juices. M/M-12 rates of change - Year 2016 
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5 Concluding remarks and open issues 


The potential improvements coming from the use of scanner data in inflation 
estimation makes this source a key point of the project of modernization of price 
statistics pursued at Italian and European level. Qualitative improvements emerge 
also from the analysis carried out in this paper and they depends on the 
characteristics of scanner data: wide temporal and spatial coverage, prices referred to 
actual transactions, information of quantities sold and turnover, the possibility of 
reducing the administrative burden which now, for example, weighs on the 
Municipal Offices of statistics that are involved in data collection of elementary 
price quotes in the field. 

Together with these potentialities, also the issues have to be stressed in order to 
deal with them. Most of these issues, in addition to those ones represented by the 
dependency on the retailers for the data provision, are of methodological nature. As 
cited before, they regard relaunches, disappeared items and their replacements, 
treatment of seasonal goods and imputations of missing observations. 

Since Istat is moving towards the adoption of a dynamic approach for the 
sampling of references and taking into account the big amount of data to be treated 
on monthly basis, automatic solutions have to be adopted to work on all the 
references that is not possible to match in the monthly sample (the disappeared 
references of the previous and the new references of the current month), evaluating 
their nature (seasonal or not), if they are relaunches, if it is better to impute the prices 
or not for the disappeared references, if and how to possibly replace them. 

Dealing with issues is the main and crucial task that Istat has assumed for the 
next weeks and months toward the adoption of scanner data to calculate official 
HICPs/CPIs in 2018. 

Intermediate results of this challenging work will be discussed in a workshop in 
Istat scheduled for the beginning of July. 
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Clustering of histogram data : a topological 
learning approach 


Tecniche di raggruppamento di dati a istogramma: un 
approccio basato sull’apprendimento topologico 


Guénaél Cabanes, Younés Bennani, Rosanna Verde and Antonio Irpino 


Abstract An histogram data is described by a set of distributions. In this paper, 
we propose a clustering approach using an adaptation of the Self-Organizing Map 
(SOM) algorithm. The idea is to combine the dimension reduction obtained with 
a SOM and the clustering of the data in this reduced space. The L2 Wasserstein 
distance is used to measure dissimilarity between distributions and to estimate local 
data densities in the original space. The main advantage of the proposed algorithm 
is that the number of clusters is found automatically. Applications on synthetic and 
real data sets demonstrate the validity of the proposed approach. 

Abstract / dati a istogramma sono descritti da distribuzioni (rappresentati in forma 
di istogramma). In questo lavoro, proponiamo un approccio di classificazione uti- 
lizzando un adattamento dell carte auto-organizzate (SOM). L’idea è di utilizzare 
una combinazione della riduzione dimensionale ottenuta con una SOM con la clas- 
sificazione dei dati nello spazio ridotto. Per misurare la dissimilarità tra le dis- 
tribuzioni viene utilizzata la distanza L2 di Wasserstein. La stessa viene utilizzata 
per la stima di densità di dati locali nello spazio originario. Il principale vantaggio 
dell’algoritmo è che il numero di gruppi viene determinato automaticamente. La 
validità dell’approccio proposto viene mostrata attraverso la sua applicazione su 
dati artificiali e reali. 


Key words: Clustering, Self-Organising Map, Histogram data, Wasserstein dis- 
tance, Density measure 
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1 Introduction 


An histogram data is described by a set of distributions, represented by histograms 
variables. A histogram is constituted by a sequence of continuous intervals with as- 
sociated a set of weights (e.g. the relative frequencies). They have been introduced 
in the context of Symbolic Data Analysis (SDA) by Bock and Diday, in the SDA 
reference book [1]. As histogram data-sets mostly from empirical distribution, the 
techniques recently developed for such data refer to distributional data. A field of 
research on distributional data analysis has also provided by the use of a suitable 
distance, the Ly Wasserstein distance, to compare distributions. This measure al- 
lows a different way of computing the distance between distributional data. The Lz 
Wasserstein distance can be decomposed in two components: the first related to the 
means (location parameter) and the second related to the higher moments (scale and 
shape parameters). In such away, the results of distributional data analysis take into 
account the main characteristics of the data. Such two components can be into con- 
sideration, separately, to analyse the influence of the size and shape of the data in 
the analysis. 

SOM for symbolic data was firstly proposed by Bock [1] to visualise in a re- 
duced subspace the structure of symbolic data. Further SOM method for particular 
symbolic data, the interval data, have been developed using suitable distances for 
interval data, like Hausdorff distance; L2 distance, adaptive distances [2]. In the 
analysis of histogram data, that represent another representation of symbolic data 
by empirical distributions, SOM has been proposed by [3] based on the Wasserstein 
L2 distance to clustering distributions. Adaptive Wasserstein distance has been also 
developped in this context to find, automatically, weights for the variables as well as 
for the clusters. However, the most part of these methods can provide a quantifica- 
tion and a visualization of symbolic data (intervals, histograms) but cannot be used 
directly to obtained a clustering of the data. The recent algorithm proposed by [4]: 
S2L-SOM learning for interval data, is a two-level clustering algorithm based on 
SOM that combine the dimension reduction by SOM and the clustering of the data 
in a reduced space in a certain number of homogeneous clusters. Here, we propose 
an extension of this approach to histogram data. In the clustering phase is used the 
L2 Wasserstein distance according to the dynamic clustering algorithm proposed by 
[5]. The number of cluster is not a priori fixed as parameter of the clustering algo- 
rithm but it is automatically found according to an estimation of local density and 
connectivity of the data in the original space, as in [6]. 

The paper is organized as follow. In section 2 we present the proposed approach. 
Section 3 shows the experimental protocol and the results obtained to validate our 
approach. Finally, a conclusion and future work is given in section 4. 
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2 DHSOM: a topological density-based clustering for Histogram 
data 


2.1 Principles of the approach 


A SOM consists of a set of artificial neurons that represent the data structure. These 
neurons are connected with their neighbours according to topological connections 
(also called neighbourhood connections). The input dataset is used to organize the 
SOM under topological constraints of the input space. Thus, a correspondence be- 
tween the input space and the mapping space is built. Two observations, close in 
the input space, should activate the same neuron or two neighbouring neurons of 
the SOM. Each neuron is associated with a prototype and, to respect the topological 
constraints, neighbouring neurons of the Best Match Unit of a data (BMU, the most 
representative neuron) also update their prototype for a better representation of this 
data. This update is important because the neurons are close neighbours of the best 
neuron. 

In DS2L-SOM [6], the prototypes of a SOM are enriched with local estimations 
of density and connectivity, allowing an estimation of the underlying distribution 
of the data. More specifically, we compute an estimation of the local density of the 
data: a measure of the data density surrounding the prototype. The local density is 
an information about the amount of data present in an area of the input space. We 
use a Gaussian kernel estimator [7] for this task. The connectivity measures how 
close are to prototypes for the data representation. The connectivity value of two 
prototypes is the number of data that are well represented by both of them (the two 
prototypes are the first two Best Match Unit for these data). From this estimation, 
it is possible to compute a clustering of the prototypes (as a representation of the 
data’s clustering) as described in [6]. In tat case, clusters are defined as region of 
the representation space having a relative high density, separated with regions of 
relative low density. As in most density-based methods, the number of clusters is 
detected automatically. 

To adapt the principles of DS2L-SOM to histogram data, we need a modified 
version of the Self-Organising Map and an adapted enrichment of the prototypes. 
We chose here a SOM algorithm for histogram data that have been proposed in 
[3], where each prototype is defined as an histogram and the distances between 
data and prototype are computed with the L) Wasserstein distance. In addition, the 
estimation of the local densities and variabilities in DS2L-SOM are mainly based 
on the distance between the data and the prototype. By using the L2 Wasserstein 
distance in the enrichment step, the clustering of histogram data becomes possible. 
It is worth of notice that the density estimated by this metric, allows to keep into 
account the all information about the characteristics of the data. 
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2.2 SOM for histogram data 


The adaptation of SOM for histogram data is based on two principle: each prototype 
is an histogram and the distances between observations and prototypes are computed 
with the L2 Wasserstein metric. In this paper we propose the use of a batch version 
of SOM adapted to histograms. The fist step of algorithm is the competition step, 
where each observation is assigned to the neuron with the closest prototype (i.e. the 
BMU: Best Match Unit) according to the L2 Wasserstein distance. The second step 
is the Adaptation step, where each prototype is updated to minimise the average 
distance between the prototypes and the observations, weighted by the topological 
structure of the map. The function to minimize is the following: 


£ N M aie 
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where x is an observation represented as an histogram, w is a prototype, N represents 
the number of learning samples, M the number of neurons in the map, u*(x*) is the 
neuron having the weight vector closest to the observation x* (i.e. the best match 
unit: BMU), and Kj; is a positive symmetric kernel function: the neighbourhood 
function. The relative importance of a neuron i compared to a neuron jis weighted 
by the value of the kernel function K;; which can be defined as: 
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A(t) is the temperature function modelling the topological neighbourhood extent. 
A; and Ay are respectively initial and the final temperature . tmay is the maximum 
number allotted to the time. d (i, j) is the Manhattan distance defined between two 
neurons i and j on the map grid. To minimize eq.1, each prototype is updated to 
represent the barycentre of the observations, weighted by K;;. The prototypes can 
be computed using a decomposition of the center and radius in each dimension. The 
weighted barycentre is then expressed as follow: 
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The introduction of the Ly Wasserstein distance between histograms pass through 
the piecewise quantile functions. So that a linear combination of quantile function 
is again a quantile function only if the weights are positive. The complete algorithm 
is described in algorithm 1. 
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Algorithm 1 SOM for histogram data 


1: Define the topology of the SOM. 

2: Initialize the prototypes w/. 

3: repeat 

4: for all histogram data x* do 

5 Select the BUM u*(x*) according to the Ly Wasserstein distance; 
6: endfor 
7 
8 
9 
0 


for all prototype wi do 

Update w’ according to eq. 2 and 3. 
end for 
: until t = tynax 


= 


2.3 Prototypes Enrichment 


When the prototypes are computed, the model can be enriched with additional in- 
formation associated to each prototypes, in order to improve the representation of 
the underlying structure of the data. Two information are computed in this step (al- 
gorithm 2). The connectivity between neurons is a measure of discontinuity in the 
topological space and allows to detect clusters separated by an empty region of the 
representation space. As this region are often not well represented by the prototypes, 
this assure the detection of cluster borders between two adjacent neurons. However, 
when the clusters’ boundary is defined by a region of lower density between two 
regions of higher density, the connectivity is not sufficient and an estimation of lo- 
cal densities is necessary. The local density D;, associated to each prototype w', is 
estimated as follow: 


No È ae) 
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with: o a bandwidth parameter chosen by user and d, (wi a) the distances between 
the M prototypes w’ and the N histogram data x*, computed with the Ly Wasserstein 
metric. 


Algorithm 2 Enrichment of prototypes 


Input: The Wasserstein distance between each observation and each prototype 
Output: A density value D; for each neuron w’ and a connectivity value vj, j for each pair of 
neurons i and j. 
for all neuron i do 
Compute the local density D; using eq. 4. 
end for 
for all data x* do 
Find the two closest prototypes (BMUs) u*(x*) and u**(x*) using: 


u* (x) = argmindy (w',x*) 


Compute v; j = the number of data having i and j as two first BMUs. 
end for 
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2.4 Clustering of prototypes 


Various prototypes-based approaches have been proposed to solve the clustering 
problem [8, 9, 10]. However, the obtained clustering is never optimal, since part of 
the information contained in the data is not represented by the prototypes. Here we 
uses the density and connectivity to optimize the clustering (see algorithm 3). 


Algorithm 3 Clustering of enriched prototypes 


Input: the density values D; and the connectivity values v;, j. 
Output: The clusters of prototypes. 


1: Extract the sets of connected neurons P = {C;}£ 


i] » Such as: 
Ym € Ci, n € C; such as Vm n > threshold 
2: In this paper threshold = 0. 
3: for all Cf € P do 
4: Find the set M(Cx) of density maxima. 
M(Cx) = {wi € Ck | Di > Dj, Vw! neighbor to w'} 


Prototypes w; and w' are neighbour if v;, j > threshold. 
5: Determine the merging threshold matrix: 


i areas tie INS 
S= [SCA]; j=1..m( WBS, j) = (3 F 3) 
i j 


6: forall prototype w; € Cp do 


7: Label w’ with one element label (i) of M(Cx), according to an ascending density gradient 
along the neighbourhood. Each label represents a micro-cluster 
8: endfor 
9: forall pair of neighbours prototypes (w’,w/) in C do 
10: merge the two micro-clusters if: 


label(i) + label(j),D; > S(label(i), label(j) and Dj > S(label(i), label(j)) 


11: end for 
12: end for 


The main idea is that the core part of a cluster can be defined as a region with high 
density. Then, in most cases the cluster borders are defined either by low density 
region or “empty” region between clusters (i.e. large inter cluster distances) [11]. 
At the end of the enrichment process, each set of prototypes, linked together by 
connectivity value v > 0, defines well separate clusters (i.e. distance-defined). The 
estimation of the local density (D) is then used to detect cluster borders defined 
by low density. Each cluster is defined by a local maximum of density). Thus, a 
“Watersheds” method [12] is applied on prototypes’ density for each well separated 
cluster to find low density area inside these clusters. Finally, for each pair of adjacent 
subgroups we use a density-dependent index [13] to check if a low density area is a 
reliable indicator of the data structure. 
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3 Experimental results 


We generated six datasets with different number of clusters and dimensions. Each 
dataset contains 1000 observations, each observation is constituted by 2 or 10 his- 
tograms (i.e. 2 or 10 dimensions). For each histogram, 1000 values were generated 
using a Gamma distribution with 3 parameters: the mean value, the standard devia- 
tion and a shape parameter, controlling the skewness of the distribution. From this 
values, an equi-depth histogram is computed using the 10th percentiles of the values 
for each interval of the support of the histogram, for a total of 10 intervals per his- 
togram, such that each interval has a weight 7 = 0.1. The observations are generated 
in respectively 3 or 5 clusters according to different parameters of the gamma dis- 
tribution. In our proposal, the Wasserstein distance takes into account the different 
components of the distribution (mean, standard deviation and shape). We expect that 
the results are strongly depending on the distance. To validate our method (’Prop”), 
we compare the results with different strategies based on different dissimilarities. 
We tested measures using only the component c; (Center) and r; (’Radius’’) in the 
Wasserstein distance decomposition. We also tested a distance based on the ”means” 
u of the distributions, and a distance based on the standard deviation (”Std”) o of 
the distribution. Finally, we tested two distances between interval data computed 
from support values of the histograms. In the first case (’Int1”) has been consider- 
ing the lower and upper values over the distribution support to define the interval 
bounds: [min, max]. In a second case (’’Int2”), we considered the mean and standard 
deviation of the distribution to compute the interval bounds: [u — 0, u + 0]. 

The obtained results are shown in Table 1. The performance of the different ap- 
proaches is evaluated using the adjusted Rand index. This index take values in [0, 1], 
1 being a perfect match with the expected clustering and 0 denoting a random solu- 
tion. 


Table 1: Adjusted Rand Index for each dataset and each approach. Param is the 
parameter defining the differences between the clusters, k is the number of clusters, 
d is the number of dimensions (i.e. the number of histogram per data) 


Param|k| d |Prop|Center|Radius|Mean| Std | Int] | Int2 
Mean |3| 2 | 1.00] 0.88 | 0.00 | 1.00 |0.00/0.00|1.00 
Mean |5]10] 1.00) 1.00 | 0.00 | 1.00 |0.00/0.43|1.00 
Shape|3| 2 |0.95| 0.57 | 0.00 | 0.00 |0.00]0.52}0.00 
Shape|5|10|0.81| 0.60 | 0.00 | 0.00 |0.00/0.19|0.00 
Std |3| 21.00) 1.00 | 0.00 | 0.00 |1.00)0.98] 1.00 
Std |5/101.00| 1.00 | 0.99 | 0.00 |1.00|1.00{1.00 


From the result we can see that the proposed method (’’Prop”) with the Ly Wasser- 
stein is able to detect correctly the cluster separations (and therefore the correct 
number of clusters) for the 6 datasets. Our approach is the only able to detect cor- 
rectly the difference in shape in the distributions, in addition to detect clusters with 
different means or standard deviations. 
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4 Conclusion 


In the paper we proposed a two-level clustering method for histogram data. The 
approach takes into account all the information about size and shape of the distribu- 
tional data in the analysis thanks to the L2 Wasserstein metric. This method is fast 
and doesn’t require the number of clusters to be fixed by the user. Indeed, the clus- 
ter’s boundaries are detected automatically based on an estimation of local densities 
and connectivities in the partitioning process. The core part of the cluster being de- 
fined as the region with higher density, the Wasserstein distances between histogram 
data allows to detect areas of low density between clusters. 

The specificity of the presented strategy is that the density is different from the 
density of classical data. Indeed, the Wasserstein distance allows to compute the 
data density according to the characteristics of the data distributions, resulting in a 
richer model of the data structure. We have shown how the proposed method give 
better results than concurrent strategies. 
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Estrarre Indicatori Sociali dai Big data per misurare il 
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Abstract Traditionally, the construction of social indicators is based upon the avail- 
ability of data collected on purpose (e.g., official statistics). It is a common view 
that constructing social indicators can benefit from the availability of new sources 
of data, that is, big data. One of the big challenges in dealing with new data sources 
is related to the possibility of describing complex social phenomena from different 
perspectives in order to enrich already used indicators and/or build new ones. How- 
ever, this possibility introduces new issues in constructing indicators. Our study 
intends to explore how the classical methodology for social indicators construction 
should be re-considered in light of using data collected for other aims. In this per- 
spective, the individual sales receipts, collected during the period 2007/13 and made 
available to our group by a big Italian chain of stores, allow us to explore not only 
a particular social phenomenon but also the methodological implications in deal- 
ing with big data. In particular, we try to (i) understand what kind of information 
can be extracted from these data, important and informative in constructing social 
indicators; (ii) study different families’ behaviour in a crucial period, by detecting 
possible changes in people’s lifestyle and eventually the role of the crisis of last 
years in these changes; (iii) elaborate and test a model aimed to extracting social 
indicators from big data. The study is enriched by the possibility to observe across 
different areas the groups behaviour (by referring to the territorial distribution of 
the stores) and to trace the individual spending behaviour over time, while ensuring 
the anonymity of the sensitive information eventually present in data. The starting 
stage of our study presented here can show some results obtained by analysing data 
with data mining clustering techniques, in order to identify some typical purchase 
behaviour but also to test if and how starting from this information it is possible to 
estimate other structural information. 
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1 Introduction 


The aim of this project is to define new social indicators describing the customers 
behaviors, starting from the analysis of big data related to their purchases in stores of 
a chain of supermarket. Traditionally, the construction of social indicators is based 
on data collected specifically, as in the case of official statistics but, in recent years, 
with the availability of new data sources such as big data, it is spreading the need 
to build new social indicators from this information. One of the great challenges 
in deal with these new data sources is related to the possibility to describe com- 
plex social phenomena by different points of view. Figure 1 shows the phases of 
the project that aims to discover information than can be extracted from these data 
and that can serve to the construction of social indicators. By analyzing the behav- 
ior of different families in a crucial period, we can observe possible changes in the 
lifestyle of the people and the role of the crisis of recent years and we can develop 
and test a model aimed to define social indicators starting from big data. The study 
of social behaviors and lifestyles of families, for which several articles in the jour- 
nal Social Indicators Research are published by Springer, can help in defining new 
social indicators, as discussed in [2] and [3]. 

The dataset labeled BigData at the top left of Figure 1 identifies the data of the 
starting process; the second BigData dataset represents the data after the selection 
phase, that is, data suitable for the project purpose. The output of the analysis phase 
is the third dataset labeled profiles, which corresponds to the groups of customers 
based on the classification emerged from the analysis phase, performed by using 
data mining techniques. The dashed arrow going from the definition of the indica- 
tors to the first dataset indicates the iterativity of the process that, in the case where 
the process has led to a good definition of the indicators, can indicate the application 
of these to another dataset. It can happen that a first iteration of the process does not 
produce any good indicators, then the process begins again from the selection of 
useful informations. The project therefore aims to explore how the classic method- 
ological approach of the definition of social indicators may be reconsidered in light 
of using the data collected for other purposes, such as for example those adminis- 
trative. Starting from transactional data, such as analyzed in [5] and [6], or by the 
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Fig. 1 Flow of the entire iterative project. 


Measuring Wellbeing by extracting Social Indicators from Big Data 229 


receipt of customer purchases, collected during the period 2007 to 2013, the project 
aims to explore not only a social phenomenon particular, but also the methodologi- 
cal implications that are encountered in dealing with big data. Through the analysis 
of this information particular categories of consumers can be identified, classified 
also in accord to how customers changed their purchasing behaviors in the period of 
economic crisis that in recent years has involved Italy. This classification, together 
with the analysis of purchasing changes, can allow to estimate the particular social 
and/or economic hardships. The analysis can also be done by deepening the behav- 
ior of particular groups of customers in relation to the geographical component, that 
is, by referring to the territorial distribution of the stores, but also trying to trace the 
individual spending behavior over time and to verify whether and how the results 
obtained with it is possible to estimate other structural information (e.g., the size and 
structure of the family). We analyze data of our case study, concerning purchases 
in a single store of the chain, selected with the appropriate features (for example 
not affected by the seasonality problem), by using the software R [8] for traditional 
descriptive investigations and the software KNIME [9] for data mining analysis. 


2 Understanding the customers behaviours 


2.1 Defining Social Indicators 


The idea is to define indicators starting from the analysis of data stored for other 
purposes, with which we can have a higher freshness than that one obtained with of- 
ficial indicators; the purpose is to control important signals, resulting from changes 
in customers purchases behaviors, which can be important to predict changes in the 
macroeconomic framework. The methodology that we propose starts from grouping 
customers using clustering techniques, with the aim to identify, for each group of 
customers, typical characteristics of each cluster, defined by how much, what, and 
how they have purchased. These three analysis dimensions are translated at macro- 
scopic level, respectively in the total expenditure, in the total quantity of products 
purchased and the number of times in which customers have been shopping. The 
choice of the temporal component, that is, the unit of observed period, is very im- 
portant in this first phase of the project; in this analysis it is the year. 


2.2 Clustering and classification: customers profiles 


We explore, through the classification of the categories to which products belong, 
if when the values of the parameters listed in the previous section change, also the 
types of products purchased change. For example, by deepening the analysis on the 
categories of products purchased, we can find that, during the crisis, a customer 
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segment decreased the purchase of niche products, to the benefit of basic/low-level 
products. The study analyzes customer data concerning amounts, quantities and 
number of expenses aggregated on the year. Starting from amount and quantity we 
can deepen the analysis to the level of the category. Firstly, the three attributes of 
analysis are analyzed in yearly level one at a time, in such a way that, for example 
starting from the information on the total amount in each year, we can associate to 
each customer a sequence of seven values, which are the amounts in the seven years 
involved. We apply the K-means clustering algorithm to data organized as shown 
in Table 1, in order to identify K clusters, containing customers who had similar 
behaviors over the years for what concerns yearly amounts. A customer can be rep- 
resented as a series of n points, one for each year; we can thus represent the behavior 
of a customer with a broken line describing its total annual expenditure trend. From 
this metaphor graphics, proposed by the authors in [1], the analysis shows customer 
groups that follow the same pattern in several years. Then the analysis is repeated 
by considering other information, that is, data relating to the quantity and to the 
number of expenses, as well as on the attribute given by the ratio amount/number of 
expenses. The results of these different analyzes can be interpreted and investigated 
together to understand, for example, what is the relationship between customers 
who have been assigned to a particular cluster for what concern the amounts or the 
relationship between those who have been assigned to the clusters obtained with the 
analysis of quantity. A peculiarity of the analysis with the K-means algorithm is 
the choice of the number K of clusters; for this, a choice of the value of this pa- 
rameter based on different values SSE (Sum of Squared Error), as reported in [7], 
obtained from several values of K, was adopted, to locate on the curve, which has 
a decreasing trend when increases the value of K, the point of maximum bending, 
which we know it can give good results. 


Table 1 An example of yearly information concerning customer 10. 


Customer_Id|Year0 Yearl ... Year(n-1) Type 
10 5 ree 8 number of exp. 
10 100 120° ua. 250 quantity 
10 300.75 600.604 ... 1050.10 amount 


2.3 Clustering and classification: products profiles 


From the analysis presented in the previous section can emerge some customers 
groups which may suggest analytical insights with the aim to find which products 
have led to a change of purchase behaviors of customers. The analysis focus moves 
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so by the customer to the product: it is therefore important to understand if and 
over what types of product, over the years affected by the economic crisis, pur- 
chasing behaviors changed. In particular, we apply clustering techniques to group 
customers to understand if there are changes in shopping cart. We would like to 
know what happens for what concern the typologies of products purchased when 
amounts, quantities or number of expenses significantly change. During crisis, it 
can happens that a group of customers reduced purchase of niche products, to the 
benefit of lower-end products. The goal is to find which are the products that can 
be considered sentry products; by keeping under control these products can help us 
to identify important signs of change in peoples lifestyle. We start from aggregated 
data, by selecting products being to 75° percentile, that is, products that are been 
bought at least from the 75% of customers; for each year and for each category we 
have the purchased quantity, as shown in Table 2. 


Table 2 An example of product quantities aggregated on the year. 


Category |Year0 Yearl ... Year(n-1) 


bread | 5460 6745 ... 18271 
dried fruit} 2900 3036 ... 4194 
potatoes | 5971 5910 ... 5553 


The analysis begins by performing a clustering step of products data concerning 
the first year, illustrated by K-Means node in the figure; then the resulting model 
of this first step is used to group data of the others years (illustrated by Clustering 
Assigner nodes in the figure). At the end of the process to each product is associated 
the list of clusters that it passed through over the years. Table 3 illustrates this result 
where the first clustering step was performed by finding k clusters. The value of k 
was chosen by considering the trend of the SSE value. Our methodology adopts a 
convection regarding the name of the resulting clusters: we assign them numerical 
labels so that the cluster corresponding to the lowest values of the amount of prod- 
uct purchased is the one with the number 0, the one corresponding to the highest 
values both the one with the number n, in the interval between these minimum and 
maximum values there are any other cluster labels. 

The ultimate aim is to know if there are any products that could reveal inter- 
esting behavior hidden in purchases that customers made. We can assign labels to 
clusters obtained by numbering them from ones containing products purchased in 
smaller quantities to those purchased in greater quantities; this agreement helps us to 
identify those groups of products that, over the years under analysis, have remained 
constant or have been purchased gradually more or, on the contrary, they have been 
bought for less. 
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Table 3 An example of products trend over the years regarding the products purchased by cus- 
tomers. 


Category; Year0 Yearl ... Year(n-1) 
bread |cluster(k-2) cluster(k-2) ... cluster(k-1) 
dried fruit|cluster(k-1) cluster(k-1) ... cluster(k-1) 


potatoes |cluster(k-1) cluster(k-1) ... cluster(k-1) 


3 Case study 


We observed purchases of about 13000 customers during 2007-2013 by analyzing 
several attributes describing the way in which they have been shopping in a store of 
a big supermarket. The dataset on which has been held the first stage of the analysis 
that we performed has 39192 lines, corresponding to 13064 customers for seven 
years from 2007 to 2013. For each customer and for each year, we have information 
about the number of expenses, quantity of products purchased and amounts spent. 
A preliminary analysis of the data showed that the information useful for this first 
study on the purchasing behaviors are those relating to customers who in each year 
have a total amount of spending under 10000 euros. From the initial dataset we 
removed 1048 customers, obtaining a final dataset of 10095 customers. 
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Fig. 2 : Trajectories of the centroid clustering on annual data, by amount, with K = 6. 


According the previous sections, we performed three K-means clustering analy- 
sis, respectively, on data concerning amounts spent, quantities of products purchased 
and the number of expenses; Figure 2 shows the results obtained from the analysis 
about the amounts of expenses, with K = 6. We observed and investigated some inter- 
esting behaviours-groups respect to the annual amounts: LC Low Constant (yellow 
line, 2485 customers) representing the group of customers who in the years spent 
low constant amounts; LG Low Growing (light blue line, 1580 customers) repre- 
senting the group of customers who made purchases by spending low increasing 
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amounts. At least, MG Medium Growing (green line, 1527 customers) that repre- 
sents the group of customers who made purchases by spending medium increasing 
amounts. According to the section 2.3, for each of the groups of customers selected 
by the result of the previous phase, we chose the products that were purchased from 
most customers, that is those being to 75° percentile. With this choice, for each of 
the three behaviours-groups we selected, we are dealing with about 100 categories 
of products; it is important to note that we are using a merchandise classification 
reaching down to the details of the product category, in the food sector, related to 
400 products categories. By analyzing the trend of products categories purchased 
from LC customer, we obtained that many products quantities remain constant in 
the period, but we observe a particular behaviour of some sentry products: e/abo- 
rate red meat and slice salumi takeaway decrease, while internal production bread 
increases.The trend of products categories purchased from LG customer put in ev- 
idence the same sentry products, but with some differences: internal production 
bread increases, slice salumi takeaway decreases, elaborate red meat decrease, but 
in this case in a lightly way. For the MG group of customers, the trend of products 
categories purchased instead shows a different behaviour: elaborate red meat de- 
crease, internal production bread increases, slice salumi takeaway remains constant 
and savory snacks decreases. We used the colors green, red and yellow, only vis- 
ible in the electronic version of this paper, to correlate the results to what we can 
observe through some corresponding color maps. We precise that color maps can 
be produced by considering only the values of selected purchased products for each 
customers group under analysis. We validate the results of the analysis on customer 
profiles, by considering the entropy of price. In particulare, for each customer, has 
been calculated the variation in price of goods in the basket. The entropy of price, 
calculated between 0 and 1, suggests how much an individual has a stable expen- 
diture (low entropy) or variable (high entropy). For the three groups of customers 
corresponding to LC, LG and MG cluster, we obatined that, for what concerns the 
LC group, we measured a higher but constant variability, which can mean a search 
for continuous offers. For LG and MG groups we found an increase in the annual 
expenditure, showing an almost constant trend index of entropy, that means a less 
attention to the price of products purchased. We also calculated the entropy of price 
for the customers groups corresponding to the red line in Figure 2 and for the de- 
creasing lines brown and violet; for the red group we observed a minor change in 
the price, indicating a customer loyalty on the chain. Brown and violet groups, re- 
lated to customers who, due to the crisis, spend less annually, show an increase in 
entropy, signifying a change of the basket in terms of price; with high probability 
there is more attention to the price than to the quality of purchased products. We 
combined and compared our results with some official statistics, that are in accord 
with our results; in particular, such as in [10, 11, 12], we analyzed the trends of 
some official indicators and indexes related to the city and to the region to which 
data analyzed in our case study refer. The period corresponding to the object of our 
analysis and that is monitored only by two temporal points (census surveys of 2010 
and 2011), presents no big changes in the indicators considered, unless a slight in- 
crease in the employment rate; besides, we observe that there are no information 
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about the indicators considered in the period between the two censuses, where there 
could be fluctuations are important. 


4 Conclusions and future works 


In This study, that has to be seen as a phase in the definition of indicators that can 
measure the wellbeing, we explored how customers change their buying patterns and 
we found out important signals putting in evidence a crisis that is also reflected in 
purchasing of essential goods. Sometimes customers opted to buy cheaper products, 
in other cases someone decided to reduce the purchase of certain products for the 
benefit of others. We are interested to understand the reason for which customers 
behaviours change: it can be because the shops network change or because people 
generally start to eat less a food, for example the meat. 
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Assessing Selectivity in the Estimation of the 
Causal Effects of Retirement on the Labour 
Division in the Italian Couples 


Selettivita nella Stima degli Effetti Causali del 
Pensionamento sulla Divisione del Lavoro nelle Coppie 
Italiane 
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Abstract Analysing the data on the Use of Time in Italy, it emerges that the 
influence of the latent bargaining process between partners affects for selectivity the 
estimation of the effect of the man’s retirement on the housework time of the woman 
in the family. We apply a proper estimation procedure in order to estimate the causal 
effects of retirement on the labour division between partners controlling for the 
selectivity of bargaining process. The results of a sensitivity analysis confirm the 
robustness of our estimates. 

Abstract Dai dati dell’Indagine Istat sull’Uso del Tempo in Italia emerge che il 
processo latente di contrattazione fra i partner comporta una stima distorta 
dell effetto del pensionamento dell’uomo sulla riduzione del lavoro domestico della 
donna. Tuttavia, adottando una particolare procedura di matching come stimatore, 
si possono controllare le stime degli effetti causali del pensionamento dell’uomo 
dalla selettività del processo di contrattazione. La robustezza delle stime ottenute è 
confermata dall’analisi sulla sensitività dei risultati. 
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Introduction 


The retirement of the male partner in older Italian couples does not seem to lead to a 
more equitable distribution of housework between partners [7,2,3]. One of the 
possible explanations is that Italian married men have a strong bargaining power and 
leave most of the housework to their wives, even after retirement. However, being 
the bargaining between partners strongly dependent on their latent cultural and 
psychological characteristics, the influence of the bargaining process on the 
relationship between retirement and partners’ housework division is generally 
misspecified. Misspecification of bargaining involves a "reverse-causality" effect, 
that is the latent endogenous influence of bargaining leads to an overstatement of the 
effect of the man’s retirement on the time devoted to housework by a woman with a 
higher bargaining power and, conversely, leads to an understatement of the effect of 
the man’s retirement on the housework time of a woman with lower bargaining 
power. In this study, we try to solve this problem taking into account the extent to 
which the endogenous component of the bargaining process influences the causal 
effects of retirement on the housework time of both partners and correcting the 
estimation results accordingly. 

To estimate causal effects of retirement, we apply a propensity-score matching 
procedure to compare the housework time of the couples in which the male partner is 
retired and the housework time of the couples in which the male partner is not 
retired. In order to control the estimated propensity to retire by the latent influence of 
bargaining, we perform a Bivariate-Probit (Biprobit) regression model [6], in which 
the two binary response-variables are given, respectively, by the decision to retire of 
the male partner (Retirement Equation) and by the satisfaction expressed by the 
woman (if she is satisfied or not) with labour division within the couple (Satisfaction 
Equation). The woman’s satisfaction with labour division is here assumed as a proxy 
of the woman’s bargaining power. A Maximum Likelihood estimator allows us to 
correct Biprobit estimates for the influence of the correlation between the error terms 
of the two equations such as a Seemingly Unrelated Regression model (SUR), being 
this correlation assumed as a measure of the endogenous influence of the bargaining 
power of the woman on the man’s decision to retire. 

The result to be obtained after the correction of the estimated propensity scores 
for the cross-correlation in the error terms is that differences in latent characteristics 
of individuals who decide to retire (treated) and of individuals that choose not to 
retire (untreated) can be considered not relevant for the propensity score estimation 
and for the results of matching ("ignorability" condition). In order to verify if the 
ignorability condition does hold, we perform a sensitivity analysis to verify the 
extent to which both estimated propensity scores and estimated causal effects change 
as a consequence of the influence on matching results of a simulated “confounding” 
covariate, introduced in the regressors set of Biprobit. 

In the next paragraph we explain how the matching procedure here adopted can 
be used as an estimator, and we discuss the properties of the estimated parameters in 
evaluating the causal effects of retirement on partners’ working activity. In the third 
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Section we discuss the estimation results and show how a marked reduction of the 
woman’s domestic activity as a causal effect of the retirement of the male partner is 
registered prevalently in the couples where the woman is generally satisfied with 
housework division in the family. Finally, the results of the sensitivity analysis show 
that controlling matching estimates for heterogeneity due to the latent bargaining 
process leads to obtain robust findings [8]. 


Data and Methods 


We compare the value of the observed housework time of a woman, yii, whose partner is 
retired, with the housework time of a woman with the same characteristics, but observed in a 
counterfactual condition (the male partner is not retired), given by yo. We will denote a 
woman who experienced the partner’s retirement as “treated”, and a woman who did not 
experience this event as “untreated”. The parameters here considered to evaluate the causal 
effects of retirement of man on women’s domestic work are the Average Treatment Effect 
(ATE), the Average Treatment Effect on Treated (ATT), and the Average Treatment Effect on 
Untreated (ATU). We assume that the observed variables Z;, influencing the propensity to 
retire are the same for treated and untreated, while the decision to retire is indicated by the 
binary dummy R((0;1), with R; = 1 signalling if the male partner is retired. Adopting the 
simple matching estimator of ATE based on propensity score, we compute 


aîe= Si -S02] è 


The estimator (1) can be easily modified conditioning the differences yı; — Voi, 
respectively, to R; = 1 (ATT) and to R; = 0 (ATU). 

The estimated propensity of the male partners to retire, obtained by the Biprobit 
estimation, are used to perform the matching procedure. That is, we match women of 
the two groups (partner retired or not). Modelling both retirement equation and 
perceived fairness equation as a Biprobit model, we assume that the decision of man 
to retire, R; and the perceived-fairness of woman, Si, are specified as follows: 


RÝ =Z'ri Br +uri (2) 
With R;= 1 (partner retired) if R° > 0, and R;= 0 (partner not retired) 
AH =Z'siBs+usi (3) 


L 
With S; = 1 (woman satisfied with the housework division) if S*; > 0, and S; = 0 
(woman dissatisfied). z’,; and z’,; are, respectively, row vectors of the matrices Z, 
and Z, of the observable variables conditioning, respectively, the propensity of man 
to retire and the perceived fairness of woman. f, and Bs are vectors of coefficients. 
We assume that the error terms ur; and us; are normally distributed M(0, 2). The 
correlation between the error terms can be considered as a measure of the latent 
influence of bargaining on the decision of man to retire. The Equations (2) and (3) 
are estimated performing a Maximum Likelihood (ML) procedure (as in a SUR 
model) taking into account the endogenous influence of bargaining measured by the 
errors correlation. We use the predicted propensities to retire, provided by Biprobit 
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regression, to apply the Simple Matching Estimator (SME), above reported by Eq. 1. 
In addition, in order to avoid the selectivity bias due to the unbalancing in the 
observed covariates, Z- and Zs, which condition the matching, we perform also a 
“Bias-Corrected” matching estimator (BCME) and a “Stratification-Matching” (SM) 
estimator [8,1,9]. For this study we use cross-sectional microdata selected from the 
2008-2009 ISTAT Survey on Time Use in Italy, in which the use of time is surveyed 
with the diary method. The selected sample is composed of No. 3,126 elderly women 
living with their male partners (aged 50-66), equitably distributed by area of 
residence. Male partners of the selected couples are employed (equal to 2,096) or 
retired (equal to 1,030). 


Estimation Results 


Estimation results obtained performing Biprobit model (Table 1) show that the 
retirement of man, more frequent in the North-Centre of Italy, is negatively related to 
his education level. In addition, the retirement of man is positively related with the 
retirement of the woman. 

Table 1: Biprobit estimation results 


Siz Ri: 

Dependent variables: Me oman s satisfaction Man's retirement 

in housework division decision 

(dummy:1= satisfied) (dummy: 1= retired) 
Explanatory variables: coef. coef. 
Intercept 2.96* -14.05*** 
Education of woman (years of schooling) 0.01 0.02 
Education of man (years of schooling) 0.01 -0.06*** 
Religiosity: 1 if the woman attends church? 0.11* 
Children living in the family: 1 yes? 0.05 0.11 
Worried: 1 if the woman feels in trouble for his work? -0.27*** -0.15* 
Woman’s Economic Dependency* 0.07 0.27* 
Age of woman -0.12* 0.09 
Age’2 of woman 0.001* 0.0001 
Age of man 0.01 0.20*** 
Area of residence: 0= North-Centre; 1= Southern regions* -0.10* -0.32*** 
Help received in paid form: 1 = yes* -0.08 -0.43** 
Health: 1 = Sick? 0.17* 
Woman retired: 1= yes* 0.48*** 
Retirement Eligibility of man (Eligibility) = 1 if he is 58 0.49*** 
years old, at least) è 
Eligibility *(Age-58) 0.09 
[Eligibility *(Age-58)*2 -0.08 
[Eligibility *(Age-58)3 0.01* 


Note: Estimated correlation between error terms, rho = 0.19 (LR test on p = 0: Chi®2 = 26.29; p< 
0.0001). P-value: *p<0.05; ** p<0.01; *** p<0.001; “Dummy Variable 


Note that the estimated correlation between the error terms of both Retirement 
and Satisfaction equations (rho = 0.189) is positive and significant. This confirms 
that an endogenous relationship between retirement of man and bargaining occurs. 
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Our results indicate a partial reallocation of the intra-household housework time in 
favour of the woman. In fact, the estimation of the treatment parameters show a 
reduction of the commitment of woman in domestic work and a simultaneous 
increase of the commitment of the male partner (cf. Table 2). More in detail, the 
estimation results on the full sample show a modest reduction of the housework time 
of woman as an effect of the retirement of the male partner. Considering the average 
of the results obtained by applying the three estimators (SME, BCME and SM), we 
have, as an estimated ATE, a reduction in the woman’s housework of about 17-25 
minutes per day and a reduction of about 17-25 minutes as an estimated A77. 
Contextually, the man’s housework increases by an average of 90 minutes per day as 
reported by the SME estimates of both ATE and ATT (without bias correction). 
However, markedly lower results have been obtained applying BCME and SM 
estimators (approximately 60-70 minutes). The estimates reported in table 3 show 
that women with higher bargaining power (satisfied with labour division) generally 
obtain a higher reduction of the housework time than women with lower bargaining 
power (dissatisfied with labour division). The reduction of housework time estimated 
by SM indicates a difference of about 20 minutes in a day in favour of the women 
with higher bargaining power. 


Table 2 - Estimated causal effects of man’s retirement on housework time (minutes in a day). 


Matching estimators: ATT SE ATU SE ATE SE HI 

Domestic work SME -19.94 8.72 -24.89 8.38 -22.52 8.37 0.41 
of woman BCME -17.38 14.36 -15.83 15.33 -16.66 13.18 -0.07 
SM -25.03 22.20 -25.44 22.70 -25.16 20.37 0.01 

Domestic work SME 91.55 7.31 86.22 6.92 88.77 6.95 0.53 
of man BCME 58.38 12.60 85.94 13.44 71.13 11.59 -1.50 
SM 72.33 17.09 63.40 16.95 67.88 15.13 0.37 


Note: HI = Heterogeneity Indicator: HI = (ATT - ATU) / SETT- aru) 


Table 3 - Estimated causal effects on housework time (minutes in a day) by satisfaction of the 
woman with labour division within the family 


Matching Woman satisfied Woman dissatisfied 


. ATT SE ATU SE ATE SE ATT SE ATU SE ATE SE 
estimators: 


Domestic SME -18.17 13.37 -23.23 15.06 -20.77 11.75 -14.39 14.31 -23.55 12.28 -19.24 13.08 
work of BCME -30.98 15.61 -24.56 14.99 -27.68 13.65 -7.9 17.64 -22.85 18.25 -15.82 16.08 
woman SM -37.05 28.98 -36.09 26.93 -36.61 27.58 -6.95 35.54 -10.67 34.79 -8.36 32.03 
Domestic SME 96.5 10.58 101.55 9.03 99.1 8.73 74.39 12.64 65.41 12.61 69.63 10.13 
workof BCME 63.73 12.53 64.91 12.51 64.34 11.18 48.71 17.07 52.58 15.49 50.76 14.53 
man SM 64.14 24.36 65.73 23.98 64.95 20.94 55.64 27.3 59.73 29.61 57.41 26.53 


1.1 Sensitivity Analysis 


We follow a parametric approach to sensitivity analysis in order to test the effect of a 
confounder variable on the estimated propensity scores and on the treatment 
parameters, such as ATT [4,5]. In particular, we replicate estimates of both 
propensity scores and 477, by including a simulated endogenous confounder in the 
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regressors set of the Biprobit regression, with the purpose to violate the condition of 
Conditional Independence assumption (CIA). In doing this, a confounder covariate, 
Ui, is simulated using the predicted values of a linear regression of the outcome 
variable y;, (housework time) on the covariates which condition the propensity score. 
We draw random values from the confidence intervals of the regression coefficients 
in order to generate the confounder and replicate the estimation procedure (No. of 
replications =1000). In Table 3 we report the results of sensitivity analysis. 


Table 4 - Sensitivity analysis of SME estimation using a confounder variable 


Confounder in Confounder in Confounder in 
Retirement eq Satisfaction eq both eqs 
mean SE mean SE mean SE 
Wilcoxon Signed-Ranks Test on matched- 0.189 0.016 -0.189 0.228 0.192 0.018 
pairs of propensity score values 
ATT for woman’s domestic work -24.696 0.118 -20.456 0.045 -24.662 0.120 
Student-t test on paired causal effects -0.233 0.013 0.081 0.013 -0.225 0.013 


Note: value for the ATT estimates using SME: Mean= -19.935; SE=9.135; 95% Conf. 
Interval (-37.875;-1.995) 


Sensitivity analysis confirms the robustness of matching procedure using 
Biprobit. The average ATT computed on No. 1000 replications of SME procedure 
using confounder does not differ substantially with respect to the estimated value 
without confounder. In particular, the treatment parameter ATT does not change 
significantly as a consequence of the endogeneity of the woman's perceived fairness 
(endogeneity of bargaining), simulated by introducing the confounder in the 
Satisfaction equation. 
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Composite indicators for ordinal data: the 
impact of uncertainty 


Indici compositi per dati ordinali: l’impatto 
dell’incertezza 


Stefania Capecchi and Rosaria Simone 


Abstract Composite indicators are becoming one of the most prominent analysis 
tools, especially in social sciences where the need arises to compare and rank groups 
of respondents by managing huge and diversified amounts of data. The aggregation 
of information is a powerful yet incomplete operation since it usually disregards of 
accounting for uncertainty. Uncertainty is here meant as the inherent indeterminacy 
of any decision process, specifically with reference to the discrete-choice process 
yielding interviewees to provide an ordinal evaluation out of their latent perception. 
The class of CUB mixture models for ordinal data is grounded on the probabilistic 
specification of this component, thus establishing a direct control for heterogeneity. 
Empirical evidence and methodological studies set this framework as an effective 
statistical modeling among well-known consolidated theories. In this setting, our 
contribution proposes a technique to build model-based composite indicators that 
discloses the role of uncertainty also at an aggregated level. The presentation is lead 
by applications to real data and comparisons with existing methods. 


Abstract Gli indici compositi sono spesso annoverati tra i più rilevanti strumenti 
di analisi, specialmente nell’ambito delle scienze sociali in cui emerge la neces- 
sità di confrontare e classificare gruppi di rispondenti gestendo grandi quantità di 
dati. L’aggregazione delle informazioni è un’operazione complessa e può risultare 
incompleta se prescinde dalla considerazione dell’incertezza. Qui per incertezza si 
intende l’intrinseca indeterminazione di ogni processo decisionale. In riferimento 
al processo di scelta discreta che porta gli intervistati ad esprimere una valutazione 
ordinale della loro percezione latente, la classe dei modelli CUB per dati ordinali è 
basata sulla specificazione probabilistica di questa componente, permettendo così 
anche un controllo diretto dell’eterogeneità. Evidenza empirica e studi metodologici 
rendono questa modellistica un’alternativa efficace tra teorie ben consolidate. In 
questo contesto, il nostro contributo propone una tecnica di costruzione di indici 
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compositi basata su modelli che rivelano il ruolo dell’incertezza anche a livello 
aggregato. La presentazione é guidata da applicazioni a dati reali e confronti con 
metodi esistenti. 


Keywords: Uncertainty; Model-based Composite Indicators; Ordinal Data; 
CUB Models 


1 Introduction 


Composite indicators are one of the most prominent tools in social sciences able 
to synthesize several information on a specific topic [3], as in well-being measure- 
ments, for instance [7]. Indeed, in the Big Data era, the introduction of composite 
indicators allows to summarize efficiently complex and multi-dimensional issues 
and to reduce the size of the available list of indicators [12]. 

Generally, composite indicators concern official data [8] and, in this respect, sev- 
eral proposals are collected [9] and promoted [6]; thus, uncertainty and sensitivity 
analysis are suggested for ensuring robustness and reliability of the results [12]. 
Then, the main burden is to discard as little information as possible in the synthesis. 

In this area, our contribution concerns indicators computed on the basis of ordinal 
data arising from surveys where people are asked to manifest their perception with a 
rating over a set of discrete choices [13]. These data are frequent in several scientific 
fields as, for instance: 


e University evaluation: scores are collected for investigating characteristics of 
both teaching and structures. 
Elderly well-being: ratings concern medical, physical and mental abilities. 
Customer satisfaction: many aspects of the relationship between clients and com- 
pany are examined to investigate loyalty, for instance. 


In the present work we assume that respondents express their ratings according to 

CUB mixture models [10, 2, 5]. This modelling approach prescribes that responses 
stem from the combination of two main components driving the decision process, 
named as feeling and uncertainty, and it has been successfully applied to analyze 
ratings on opinions, judgments and preferences in several disciplines. Here we pro- 
pose a strategy to maintain these two main components also at an aggregated level 
by presenting a model-based composite indicator. Due to space constraints, we de- 
fer any unspecified detail to references and we limit to recall that the characterizing 
uncertainty parameter (7) may be interpreted as a distance from a completely ran- 
dom choice, and that each CUB model may be uniquely represented as a point in the 
unit square. 
The paper is organized as follows: in the next section, the new proposal is intro- 
duced and in Section 3 some empirical evidence is discussed. Final considerations 
about further generalizations and developments are summarized in the concluding 
remarks. 
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2 A new proposal 


Consider a questionnaire designed to measure a latent trait (such as teacher’s per- 
formance, customer satisfaction, etc.) via K observable variables R1,...,Rg col- 
lected on an ordinal scale (say, with m categories). Let ||rjx||, for i=1,2,...,n and 
k =1,2,...,K the matrix of the responses given to K items by the n respon- 
dents. Then, r; = (r;,1,...,r,g) is the row-vector of observations on the i-th subject, 
i=1,...,n. 

According to a model-based approach, we assume that a CUB model fits the data 
in an effective parametric way; thus, Ry, ~ CUB (tes Ex), fork =1,...,K, where: 


m—1 ha 1 
Prerok = (1 _ 1) mE) + (1 m) =. 


Then, we propose a weighted CUB model R ~ CUB (Ž, È): 


K K 
t=) wet, E=} webr (1) 
k=l k=l 
as a 2-dimensional composite indicator for the latent trait (Composite Indicator CUB 
model, CI-CUB , for short). This choice allows to take both uncertainty and feeling 
into account by assigning higher weights to the most relevant items (as meant, for 

instance, by PCA). 

Customarily, composite indicators are computed on individual basis by rang- 
ing the data matrix per rows. Classical proposals include some average operations 
(arithmetic, geometric, harmonic) of the individual ratings, or the selection of the 
first component of a principal component analysis (PCA) performed to the data ma- 
trix, say Y4: 

Yı =aıRı +: agRk, 
K 
with weights a,,...,ax such that: Y a = |, and set w = a. 

Then, in order to get an overall assess Meni of the latent trait under investiga- 
tion, their distributions should be suitably taken into account. For instance, a new 
variable ranging from 1 to m can be obtained for each of them with a (uniform) dis- 
cretization over m categories. In this way, comparisons with the CI-CUB proposal 
can be enhanced by fitting a CUB model to each of the resulting variables and the 
corresponding estimated parameter vectors (7,6) can be considered as model-based 
composite indicator itself. For instance, if we consider the discretized version of the 
arithmetic mean R4, we assume that R4 ~ CUB (fa, È) is an adequate model for 
the arithmetic mean composite indicator. Similarly, for the first PCA component and 
the other average operators. 
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3 Empirical evidence 


We experiment the proposed approach (1) on the data set rel goods referred to a 
survey on well-being and relational goods (available in the package CUB in R). As 
for any measure of subjective perception, data related to personal awareness always 
come with some remarkable caveats. Many studies have discussed the reliability of 
self-reported measures, even with respect to frame-of-reference effects and adapta- 
tions to life events: see [11], among others. Moreover, to detect progress and human 
(well-being and/or) “good-life” it is necessary to build a set of reliable indicators to 
make the information understandable to stakeholders [7]. 

Figure 1 displays a multiplot of CUB models of the selected items, the estimated 
CUB models derived from the arithmetic (AM), geometric (GM), harmonic (HM) 
averages, and from the first component of the PCA1, as explained in Section 2. This 
scatterplot is obtained by plotting estimated uncertainty 1 — 7 against estimated feel- 
ing 1— é for each model (weights for the CI-CUB here have been derived from the 
first principal component). In addition, the mean CUB model (MeanCUB) obtained 
by simply averaging both the estimated 7°s and &’s parameters (constant weights) 
is shown. 
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Fig. 1 MultiCUB for relational goods and related composite indicators 
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Differently from other proposals, From this visual inspection, the CI-CUB pro- 
posal affords a more adequate aggregation of information since it coherently pre- 
serves both uncertainty and feeling. Indeed, it reports the explicit evidence of a 
possible heterogeneity in the responses. Conversely, PCA1, AM, GM and HM give 
a rather biased synthesis of the data since the model-based versions of these com- 
posite indicators is farther from the estimated data models and loose to catch the 
uncertainty component. 


4 Comparing groups and individuals 


Customarily, composite indicators are exploited to compare and rank different 
groups with respect to the investigated phenomenon (countries, departments of uni- 
versities, teachers, clusters of respondents identified by covariates, and so on). To 
pursue this task according to the CI-CUB proposal for ordinal responses, assume that 
data are gathered into H groups. 

Then, for h = 1,...,H, consider the methodology described above to build a CI- 
CUB È, ~ CUB (ĉn, Ên): 


e Let r” = (rf, rf) denote the vector of observations of the i-th subject 


(unit) within the h-th group, i=1,...,m),,h=1,...,H. 

e For every h=1,...,H andk=1,...,K fit a CUB model to responses Rx in the 
h-th group, resulting in R” ~ CUB (nl?) e). 

e Then, for a suitable system of weights, consider the CI-CUB model Rt!) ~ 
cus (#”),€), with: 


Since there is not a unique ordering of two-dimensional vectors, if the (la- 
tent) phenomenon under investigation is positive in the direction of the scale, we 
suggest to compare and rank the H groups according to the composite indicator 
Pr(R > m |e), È (!)), where m* < m is a threshold category lower-bounding the 
positive responses. The reverse direction applies if the (latent) phenomenon under 
investigation is negative in the direction of the scale. If, instead, an individual com- 
posite indicator is more suitable for the analysis, a model-based proposal stemming 
from the setting here developed is to consider: 


K 
L= Y wPr(Rx=talî;&), i=1...n 
k=1 
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5 Conclusions 


As confirmed in the selected case-study, commonly acknowledged choices to build a 
social indicator based on averages and PCA are highly data-dependent. Then, if the 
first principal component is not really explanatory (for instance, the first principal 
component captures only about 27% of the variability for the selected items of the 
case study), the resulting index cannot be assumed for the latent trait under exami- 
nation. In addition, they completely loose to account for the uncertainty component, 
thus the standard approaches waste an important amount of information. Conversely, 
the model-based approach that leads to the CI-CUB proposal gives satisfactory per- 
formances and it easily lends itself to encompass more refined item-based analysis, 
as when overdispersion [4] or shelter effect [5] are suspected for certain items: this 
extension and further developments on the selection of optimal weights are left for 
future works. 
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The distribution of Net Promoter Score in 
socio-economic surveys 


La distribuzione del Net Promoter Score nelle indagini 
socio-economiche 


Stefania Capecchi and Domenico Piccolo 


Abstract In marketing studies devoted to customer satisfaction and consumers’ loy- 
alty analysis, Reichheld (2003, 2006) proposed the Net Promoter Score (NPS) to 
synthesize by means of an excess index the distribution of the sample responses to 
a question as: “How likely is that you would recommend our Company/Institution 
to a friend or colleague?”, on an ordinal scale from 1 to 10. More specifically, this 
measure is obtained by the difference between the proportion of “enthusiastics” mi- 
nus that of “passives”. This index may be fruitfully exploited in different research 
fields, of course, where the whole set of information is meant to be summarized 
by comparing the relative frequencies of “supporters” and “detractors” with respect 
to products, services, items, etc. Although the literature remarks critical and posi- 
tive aspects of such an index, only recently Rocks (2016) are faced with inferential 
procedures with regard to NPS. In this study, we search for the distribution of NPS 
based on a convenient structure of the response patterns. Indeed, we assume a para- 
metric mixture for the responses and verify the behaviour of NPS over the parameter 
space. 

Abstract Nell’ambito degli studi di marketing, Reichheld (2003, 2006) ha introdotto 
il Net Promoter Score (NPS), una sorta di misura di eccesso per la distribuzione 
delle risposte alla domanda, su scala ordinale da 1 a 10: “Quanto raccomanderesti 
ad un tuo amico o collega la nostra Società/Istituzione?”. L’indice è la differenza 
tra la proporzione dei rispondenti entusiasti e quella dei detrattori. Tale misura può 
essere proficuamente utilizzata anche in altri ambiti dove è opportuno confrontare 
la frequenza relativa dei “molto favorevoli” a servizi, prodotti, items, etc. con quella 
dei “critici”. In letteratura si discutono aspetti positivi e critici di tale proposta e 
recentemente Rocks (2016) ha affrontato la questione da un punto di vista inferen- 
ziale. In questo lavoro, esaminiamo la distribuzione dell’indice NPS sotto l’ipotesi 
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che le risposte siano generate da una mistura idonea per le indagini con risposte 
ordinali riguardanti giudizi e/o opinioni. 


Keywords: Ordinal data, Net Promoter Score , Mixture models 


1 Introduction 


Customer satisfaction is one of the most important concern of the companies since 
this variable summarizes reactions and sentiments of clients. More specifically, loy- 
alty is indicated as a fundamental component to maintain success. Proposals have 
been introduced to alert companies in order to monitor and predict this key driver. 

Among the several syntheses aimed at interpreting the mood of clients, a main 
question emerged as a signal of confidence and loyalty towards the Company: “How 
likely is that you would recommend our Company/Institution to a friend or col- 
league?”, with responses on an ordinal scale from 1 to 10. Thus, Reichheld [6, 7] 
introduced the Net Promoter Score (NPS) as the proportion of extremely favourable 
respondents minus the proportion of disaffected ones. Notice that NPS is a trade- 
mark of Stametrix Systems, Inc., Bain & Company, Inc. and Freid Reichheld. 

Briefly, NPS is considered as a customer loyalty metric able to measure bond, en- 
dorsement and sponsor support between a provider and a consumer. Although some 
critical comments, this measure is now regularly applied by thousands of compa- 
nies. In general, it has become a benchmark for the policy of the companies and its 
use may be easily extended to the fields of products, services, holidays, financial 
advisors, banks, diets, sanitary protocols, educational training, and so on. 

From a statistical point of view, NPS is an estimate of the mean value of a discrete 
random variable whose probabilities are generated by a distribution expressing the 
graduated opinions of a sample of respondents on an ordinal scale, ranging from 1 
to m, for a given m. 

In this paper we show that a large collection of models generate the same NPS. 
In addition, the (underestimated) uncertainty always present in human decisions as 
well as the heterogeneity of the respondents may largely affect the NPS value. The 
framework of the analysis is a model-based approach for the data generating process 
by which respondents express their judgements about the selected question. 


2 Notation and formal background 


Let R be a discrete random variable defined for a given m on the support {1,2,...,m} 
and able to describe the mechanism of the ordinal responses. Assume that R is fully 
characterized by the probability distribution p, = p,(0) = Pr (R =r | 9), for r= 
1,2,...,m, where 0 € Q(@). Then, the NPS is defined by: 
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m a 
NPS=) pr-Y prs  -1<NPS<+l; 
r=b r=1 


where 1 < a < b < manda and bare given integers. 

People with scores in R € [1, a], R € [a+1,b— 1] and R € [b, m] are denoted 
as “detractors”, “passives” and “promotors”, respectively. Thus, we let: pae = pi + 
+ Pa Ppas = Pa+1 ++ Ph-1s and Ppro = Pb +-+ Pm. In common analyses, 
m= 10, a = 6, b = 9; sometimes, the Likert scale starts at r = 0. 

It is immediate to show that NPS coincides with the expectation of a discrete ran- 
dom variable X defined on the support {— 1,0, 1} with probabilities { Paer, Ppas; Ppro}. 
respectively. Thus, all the characteristic of this index are specified in the ternary sim- 
plex. In particular, according to Huber[2], 


= 2 
(X) = NPS= Ppro — Pdet > Var(X) Ppro + Pdet [Ppro Paer] : 


As a mean value, NPS may be effectively estimated by 


m a 
NPS=) fr-)} fi -1<NPS<4+l1; 
r=b r=1 
where f, = ny/n and n,, for r = 1,2,...,m, are the relative and absolute fre- 
quencies derived by the sampling distribution (11, n2,...,7m) of scores, with n = 


n1+-:- + tm. Of course, such frequencies are realizations of a Multinomial distri- 
bution characterized by n and p, = p,(0), for r = 1,2,...,m. 

These results imply that NPS is an unbiased and consistent estimator of NPS 
with variance n—!Var(X). Moreover, the standardized NPS is asymptotically Nor- 
mal distributed; thus, tests and confidence intervals may be assessed. A survey of 
inference for the NPS estimator, with some improvements, is discussed by [8]. 


3 A model-based approach 


On the basis of experimental evidence and statistical reasoning, we assume that or- 
dinal responses of the customer judgements/opinions are generated by a CUB model 
as in [5, 1]. More specifically, R ~ CUB (7,6) is a random variable defined over the 
support {1,2,...,m}, for a given m, whose probability mass distribution is: 


m 


r 


-1 1 
Pr(R=r|0)= n( i) Emr —E)" + (1-7)—=,r=1,2,...,m. 

— m 
The model is well defined over the parameter space: Q (0) = Q (7,6) = {(7,¢%) : 
0<a<1; 0<6& <1} and itis identifiable for any m > 3, whereas m = 3 represents 
a saturated model [3]. 
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Then, (1 — 7) increases with the indecision/heterogeneity of the responses whereas 
(1 — &) increases with the confidence/loyalty of the client. Subjects’ covariates are 
generally linked to parameters by a logistic function for simplicity; however, other 
mappings are legitimate. 

The advantage of this parameterization is the ability to capture different patterns 
of the observed distributions (in terms of modal values, skewness and flatness) by 
means of only two parameters 0 = (2,6)' which are easily interpreted with respect 
to the components of the random process [4]. In addition, any CUB model admits a 
visual representation as a point in the parameter space Q (0). Then, the introduction 
of subjects’ covariates permits to investigate if and how the individual characteristics 
of the respondents affect the expressed opinions. 
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Fig. 1 NPS contour plots over the parameter space of a CUB model (left panel). Subregion where 
NPS is positive (left panel) 


In this line of reasoning, the behaviour of NPS when responses follow a CUB dis- 
tribution is investigated. The left panel of Figure 1 shows the contour lines for given 
NPS over Q(@) and distinguishes negative, null and positive values of this mea- 
sure. Then, the right panel magnifies the top-left area of the parameter space where 
NPS > 0: that is the area of interest for companies searching for a growth. As it is 
expected, a rewarding NPS is the consequence of both a moderate uncertainty and 
a substantially positive endorsement; however, infinitely many CUB models refer to 
the same NPS. 
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4 The role of uncertainty 


In CUB models framework it is important to emphasize that (1 — 7) is the weight for 
the Uniform distribution and thus it measures both personal indecision of respon- 
dents and the presence of different reactions in the sample (heterogeneity). This is a 
twofold meaning which manifests in the same parameter: in fact, if people converge 
on a category this means a low level of indecision among respondents and it gener- 
ates a low heterogeneity within the sample. On the contrary, when respondents are 
fuzzy and quirky they generate high heterogeneity. This component (which may be 
effectively estimated in sample data) affects also the interpretation of NPS. 

As a matter of fact, Figure 1 emphasizes the role of uncertainty in the assessment 
of the NPS index. For instance, a feeling as high as 1 — é = 0.85 and a very low 
indecision/heterogeneity of respondents expressed by 1 — m = 0.10 generates an 
NPS = 0.50 whereas an increase of uncertainty up to 1 — 7 = 0.30, say, lowers NPS 
to 0.25. This implies that a modal value very high in the responses distribution is a 
necessary but not a sufficient condition to get an appreciable positive NPS. 
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Fig. 2 Effect of uncertainty in the specification of NPS 


Further insight may be derived from the consideration that the feeling of respon- 
dents is modified by uncertainty/heterogeneity. In a sense, we would like to obtain 
just NPS for the feeling (=NPS fee1, say) but we estimate NPS on the basis of the 
expressed responses (which are a mixture of both components). A simple algebra 
proves that: 

NPS = TNPSfee + (1 — T) const , 
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where const = (m+ 1 — a — b) /m and, in common cases, const = —0.4. Except for 
x = 1, that is a model without uncertainty, the presence of indecision/heterogeneity 
in the data reduces NPS. 

Figure 2 shows how the effect of uncertainty modify the correspondence between 
the desired NPS fee; and the observed NPS by systematically lowering the second 
one but for the bisector (where 7 = 1). 

This effect propagates in a different way when a CUB model with covariates is 
considered. In the simplest case of a dichotomous variable only affecting the feeling 
(for instance, D; = 0,1 for men and women, respectively), the differential effect of 
NPS between women and men, with obvious notation, is: 

NPS” — NPSO = x [NPS — NPS]. 

As a consequence, a possible discrimination between the two clusters would ap- 

pear attenuated by a factor of 7 due to the presence of uncertainty. 


5 Concluding remarks 


A widespread experience derived by more than a thousand of observed NPS [8] 
shows that this measure takes values mostly between 0 and 0.50 with a variance in- 
cluded in [0.50, 0.75]. As a consequence, effective studies should be concentrated on 
the NPS distributions putting a large mass on this sub-region of the parameter space. 
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News, Volatility and Price Jumps 
News, Volatilita e Salti nei Prezzi 


Massimiliano Caporin and Francesco Poli 


Abstract From two professional news providers we retrieve news stories and earn- 
ings announcements of the S&P 100 constituents and 10 macro fundamentals, more- 
over we gather Google Trends of the assets. We create an extensive and innovative 
database, useful to analyze the link between news and asset price dynamics. We de- 
tect the sentiment of news stories using a dictionary of sentiment words and nega- 
tions, and propose a set of more than 5K information-based variables that provide 
natural proxies of the information used by heterogeneous market players and of re- 
tail investors attention. We first shed light on the impact of information measures on 
daily realized volatility and select them by penalized regression; then, we use them 
to forecast volatility and obtain superior results with respect to models that omit 
them. Finally, we relate news with intraday jumps using penalized logistic regres- 
sion. 

Abstract Ricaviamo da due news provider professionali le news e gli annunci sugli 
utili dei componenti dell’S&P_ 100 e 10 indicatori macroeconomici, inoltre rac- 
cogliamo i Google Trends associati ai titoli. Creiamo un database esteso ed in- 
novativo, utile per analizzare il legame tra le news gli andamenti dei prezzi dei 
titoli. Rileviamo il sentiment delle news usando un dizionario di parole associate a 
un sentiment e delle negazioni, e proponiamo un insieme di più di 5K variabili che 
rappresentano l’informazione usata da agenti eterogenei e l’attenzione dei piccoli 
investitori. Facciamo luce sull’impatto delle misure di informazione sulla volatilità 
realizzata giornaliera e le selezioniamo con la regressione penalizzata; poi le usi- 
amo per prevedere la volatilità, ottenendo risultati superiori rispetto a modelli che 
le omettono. Infine, mettiamo in relazione le news con i salti intragiornalieri usando 
la regressione logistica penalizzata. 
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Key words: news, Google Trends, sentiment, volatility, forecasting, jumps, regu- 
larization, big data 


1 Introduction 


According to the MDH mixture of distributions hypothesis: “A serially correlated 
mixing variable measuring the rate at which information arrives to the market ex- 
plains the GARCH effects in asset returns.” We want to verify its validity and, more 
generally, to shed light on the link between news and volatility. In addition, we want 
to understand which news indicators are likely to provoke price jumps. 

We create a database which contains information useful to face the previous ques- 
tions. From two news providers we retrieve news stories and EPS earnings per share 
announcements of the S&P 100 constituents, and 10 macroeconomic announce- 
ments. We also collect Google Trends of the assets, and use them as a proxy for retail 
investors attention. We detect the sentiment of news stories using the sentiment- 
related word lists developed by [6] and introduce a set of negations, with the aim of 
extracting the sentiment of a financial text independently from its type, length and 
audience. We propose a set of news measures that provide natural proxies for the 
information used by heterogeneous market players. We end up with a large set of 
news measures, each representing a different type of information potentially caus- 
ing a different market reaction. We test the MDH and shed light on the impact of 
news on volatility using the information-related variables we develop. We perform 
an application using the database to explain realized volatility and selecting the most 
important indicators with LASSO, then we improve volatility forecasting in an out- 
of-sample analysis. Finally, we relate news with intraday jumps with Elastic Net. 


2 Database Construction 


We collect news and indicators from two news providers, FactSet-StreetAccount 
and Thomson Reuters, and from Google Trends. We utilize the latter as a proxy 
for retail investors attention, while providers gather information more relevant for 
professional investors. Time range of the dataset corresponds to the period February 
4th 2005 - February 25th 2015 and all data has minute-precision, except for Google 
Trends that are daily. 

We get firm-specific news and Google Trends of the S&P 100 constituents, since 
they are highly capitalized and attention grabbing companies. We exclude from the 
database 11 stocks since news about them were not available from both providers 
for the whole sample. The information of the database can be classified in five types: 


1. StreetAccount news stories (firm-specific). They are classified along 11 topics, 
and we use 7 of them. News are filtered from irrelevant ones and are not redun- 
dant, that is each news is released only once. 
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2. Thomson Reuters news stories (firm-specific). Each story is organized accord- 
ing to a topic and a level of significance. There are 36 topics, and we use 6 of 
them, and four levels of significance: low, medium, high, top. They are also fil- 
tered from irrelevant ones and are not redundant. 

3. EPS announcements. They are released by StreetAccount and comprehend both 
the company’s reported actual quarterly EPS and the consensus forecast figure. 

4. Macro announcements. 10 macroeconomic indicators released by Reuters: con- 
sumer confidence, CPI, FOMC rate decisions, GDP, industrial production, bal- 
ance of payments, jobless claims, non-farm payrolls, PPI and retail sales. They 
also comprehend both the reported indicator and its consensus forecast. 

5. Google Trends. Relative indicators of internet search volume available from 
Google. They summarize the searches performed through Google and represent 
how many web searches have been done for a keyword in a period of time in 
a given geographical area relative to the total in the same period and area. For 
each stock, we look at the global volume of search queries for the name of the 
company. 


3 Sentiment Detection 


We detect the sentiment of news stories, that is an indicator of whether the content 
of a document is good, bad or neutral in relation to the issue it talks about. 

We use the sentiment-related word lists developed by [6], which are tailored for 
financial texts. They account for negation but use only six words and only if one of 
them precedes a negative word, and apply the methodology to US companies 10-Ks. 
We deal, instead, with news created by news providers, that are less limited in the 
use of language. We introduce the following improvements: we invert the sentiment 
each time a word, irrespective of whether it is positive or negative, is preceded by a 
negation, and extend negations by employing 28 single words, 24 sequences of two 
words (e.g. “far from”) and 6 sequences of three words (e.g. “by no means”). We 
believe that this modification allows to extract the sentiment of a financial text with 
more confidence and independently of its type, length and audience. The procedure 
we develop works as follows: 


1. positive words are given a value of 1, negative words -1 and the value is inverted 
in case of negation 

2. values of all words with a sentiment are summed up to get the sentiment sum: 
Sent_Sum = rN Si , where i is the word index, N is the number of words with a 
sentiment in a text and s; is the sentiment of the word indexed by i 

3. Sent_Sum is divided by the number of words with a sentiment, obtaining the rel- 
ative sentiment Rel_Sent, comprised between -1 and 1: Rel_Sent = Sent_Sum/N 

4. If Rel_Sent is bigger than 0.05 or smaller than -0.05 we associate, respectively, a 
positive (1) or a negative sentiment (-1) to the news, otherwise neutral (0). 


256 Massimiliano Caporin and Francesco Poli 


4 Measures Creation 


We go beyond the standard techniques used to assign numbers to textual informa- 
tion: we identify a set of concepts/events which are based on how news are released 
over different time horizons, with the aim to reconstruct the different portions of 
information on which the different market players base their decisions. In total, for 
each asset we end up with 5,159 news-related variables for daily analysis and 878 
news-related variables for high-frequency analysis. 


Concepts for News Stories Variables 


The variables are built following a scheme of several concepts, each of which is 
peculiar in the reaction it potentially causes in the market. All concepts refer to a 
reference period and to previous periods of equal or longer length. We list the main 
ones. 


1. standard measures: number of news, number of words, sentiment. The first two 
represent proxies for the quantity of information, sentiment was illustrated above. 

2. abnormal quantity: quantity of news above a threshold. Investors’ reaction 
could be triggered by the release of an unusual quantity of information. 

3. uncertainty: occurrence of news with opposite sentiment within the reference 
period. When this event happens, information is released but it is likely that in- 
vestors are unable to detect whether it is good or bad. 

4. quantity variation: variation across periods of the quantity of news, or words. 
This concept takes into account the chance that investors’ reactions are triggered 
not only by the release of information, but more generally by increases or de- 
creases in the quantity of information. 

5. news persistence/interaction: event in which the quantity of news is above a 
threshold in each of two consecutive periods. Reminding that providers do not 
supply redundant news, the occurrence of this event denotes persistence in the 
release of news that are related in each period to a different issue. 

6. sentiment inversion: event in which the sentiment of the reference period equals 
the opposite of the sentiment of previous periods. 


Standardized Surprises of EPS and Macro Announcements 


EPS and Macro Surprises are constructed using techniques widespread in the liter- 
ature. With regard to EPS, from actual figure and consensus forecast we compute 
the SUE Standardized Unexpected Earnings score, which measures the number of 
standard deviations the reported actual EPS differ from the consensus forecast. 

With regard to macro announcements, from actual and consensus forecast of the 
indicators we compute the standardized surprise as we do for earnings. 
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Google Search Index 


Google restricts the access to daily data for intervals longer than 10 months but 
allows to gather daily data (relative to the maximum) for shorter intervals. From 
the set of the daily series for each month and the monthly-aggregated series for the 
whole sample we reconstruct the daily Google Search Index for the whole sample. 


Proposed Measures Based on Different Time Horizons 


We propose a set of news measures suitable to be linked to daily asset price dynam- 
ics, by aggregating the information released during the following time horizons: 


. daily: from market closing time of day t-17 to market closing time of day t; 

. overnight: from market closing time of day t-17 to market opening time of day t; 
. weekly: last 5 days; 

. monthly: last 22 days. 


DPUWNI 


We then develop a set of news indicators which can be related to high-frequency 
asset price dynamics, by aggregating the information released during market open- 
ing times in three different lagged intervals: 


1. lag 0: last 10 min; 
2. lag 1: from -30 min to -10 min; 
3. lag 2: from -60 min to -30 min. 


5 Volatility Forecasting and Intraday Jumps 


We want to verify the validity of the MDH and to shed light on the link between 
news and volatility. We also want to understand which news indicators cause jumps. 


News Impact on RV 


We compute daily realized volatility from five-minute returns. Then, we decompose 
it into its continuous and jump components resorting to the jump test of [4], and 
we model daily realized volatility with the HAR-TCJ linear model of [4], based 
on the HAR-CJ model of [1] using their corrected threshold multipower variation 
measures. 


HAR-TCJ model: 


RV, = Bo BaCa t Biv BinCm t BjJa FE (1) 


(RV): = reni Via RV;, Ca = C1, Gy = G51, Gn = Ĝ-221-1, Ja = Ft) 
Adding the news measures as regressors we obtain the HAR-TCJN model: 
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RV, = Bo È BaCa BuCw t Brim Bia È Brews News1 + & (2) 


where Bwews is the k x 1 vector of coefficients and News;_1 is the k x 1 vector of news 
measures built on the basis of the information available before the market opening 
time of day t. We face a dimensionality problem in the HAR-TCJN model since the 
number of regressors is higher than the number of observations, and we resort to 
LASSO to select the most useful measures. LASSO [7] is an estimation method for 
linear models that performs variable selection and coefficients shrinkage, and was 
already used to model realized volatility by [3]. 

We implement an in-sample analysis using the logarithmic counterparts ([4]) of 
the models, and estimate the parameters of the HAR-TCJN model with LASSO. 
Ranking the indicators by the number of assets for which their estimated £ is differ- 
ent from zero, it is possible to see that macro announcements and EPS are the most 
important drivers of volatility, but news stories and Google Trends also have a role. 
Macro announcements per se count, as well as surprises from expectations. Markets 
tend to react more strongly to negative surprises, and on the basis of the information 
released during several previous time horizons, from overnight to the last month. 
EPS announcements per se and surprises are both important as well, and there is 
no evident asymmetric effect between positive and negative surprises. Only EPS 
information released during the last day seems relevant. News stories from Stree- 
tAccount are slightly more useful to explain market reactions than Reuters news, 
and variables based on day-to-day variations of the rate of information arrival are 
the most useful. Earnings is the most important news topic. Retail investors attention 
during the last week, caught by Google Trends, is positively linked with volatility. 

In order to test the MDH, we perform two different OLS regressions with HAC 
standard errors: one for the HAR-TCJ model and one for the HAR-TCJN model 
employing as news variables only the previously selected ones, and compare the 
estimated autoregressive coefficients between the two models. Table 1 presents the 
estimation results for the autoregressive coefficients (cross-sectional average) Bo, 
Ba, Bw, Bm and By for both models and their variation after the inclusion of news. 
Coefficients are all positive and, with the exception of Bm, their value is lower for 
the model HAR-TCIN, while the intercept fp is higher for the HAR-TCJN model. 
These variations highlight the relevance of news as a driver of additional informa- 
tion, which involves effects on the estimated autoregressive coefficients. Results are 
consistent with the MDH. 


Evaluating the Forecasting Performance Improvement 


Using a rolling window long 1000 observations, we iteratively estimate the HAR- 
TCJ and the HAR-TCJN models and apply the estimated coefficients to the informa- 
tion available the day following the last day used for estimation, obtaining the one- 
step-ahead forecast of realized volatility. The forecasting performance of the two 
models is compared with five metrics: MAE mean absolute error, MSE mean square 
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Table 1 


log HAR-TCJ log HAR-TCIN AB 


Bo 0.34 (2.50) 0.65 (3.35) 0.31 
Ba 0.26 (2.64) 0.23 (2.52) -0.03 
By 0.46 (2.88) 0.38 (2.86) -0.08 
Bn 0.20 (2.13) 0.22 (2.26) 0.02 
Bj 0.18 (0.83) 0.14 (0.52) -0.04 


Estimated (cross-sectional average) Bo, Ba, By, Bm and By and their t-statistics in brackets for the 
log HAR-TCJ and the log HAR-TCJN models, and variation of the coefficients between the two 
models. OLS regression with HAC standard errors, using as explanatory variables for the HAR- 
TCJN model the regressors of the log HAR-TCJ plus the news variables selected by LASSO. 


error, R? of Mincer-Zarnowitz forecasting regressions, HRMSE heteroskedasticity 
adjusted mean square error, QLIKE. 

Table 2 reports the cross-sectional mean over all assets of the metrics, and in- 
cludes in brackets, for all metrics except for the R? MZ, the percentage of assets for 
which the Diebold-Mariano test [5] rejects with a 5% significance level the null hy- 
pothesis of equal predictive accuracy in favor of each model, and in brackets for the 
R? MZ the percentage of assets for which the metric is higher (i.e. a superior pre- 
dictive accuracy) for each model. The HAR-TCJN model yields on average lower 
MAE, HRMSE and QLIKE and a higher R? MZ. The average MSE is instead lower 
for the HAR-TCJ model. The HAR-TCJN model imply a better forecasting power 
which is statistically significant for a percentage of stocks ranging, depending on 
the metrics, from 11.24% to 82.02%. The test never signals a statistically significant 
superior predictive accuracy of the HAR-TCJ model. 


Table 2 

log HAR-TCJ log HAR-TCIN 
MAE 0.96 (0.00%) 0.95 (26.97%) 
MSE 33.82 (0.00%) 34.30 (11.24%) 
R? MZ 0.50 (26.97%) 0.51 (73.03) 
HRMSE 0.92 (0.00%) 0.82 (59.55%) 
QLIKE 1.45 (0.00%) 1.44 (82.02%) 


One-step-ahead MAE, MSE, R? MZ, HRMSE, and QLIKE of the log HAR-TCJ and the log HAR- 
TCIN models (cross-sectional average). In brackets, for each metric except for R? MZ: percentage 
of assets for which the Diebold-Mariano test rejects with a 5% sign. level the null hypothesis of 
equal predictive accuracy in favor of that model; for R? MZ: percentage of assets for which the 
metric is higher for that model. 
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News Measures and Intraday Jumps 


We identify the precise intraday intervals at which jumps occur, relying on the pro- 
cedure of [2] using the corrected threshold multipower variation measures of [4]. 
Indicators are selected using the Elastic Net ([9]) in a logistic regression with the 
occurrence of jumps (1 for occurrence, 0 otherwise) as dependent variable. [8] point 
out that the logistic regression is often plagued with degeneracies when the number 
of covariates p is greater than the number of observations N and exhibits wild be- 
havior even when N is close to p; the Elastic Net penalty alleviates these issues, and 
regularizes and selects variables as well. 

Results tell us that macro announcements, especially FOMC rate decisions, as 
well as news stories, independently of their topic, cause jumps, and that all lagged 
intervals used to aggregate information are relevant. 


6 Concluding Remarks 


Our empirical results validate the Mixture of Distributions Hypothesis, showing the 
relevance of news as an important driver of volatility. Macro news and EPS are the 
most influential, followed by news stories and Google Trends. Aggregating infor- 
mation over different time horizons is important. By including news-based informa- 
tion, we are able to improve volatility forecasting. Macro announcements, especially 
FOMC rate decisions, and news stories are related to intraday jumps, which can fol- 
low immediately or with a delay ranging from few minutes to one hour. 
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Growing happiness: a model-based tree 


Carmela Cappelli, Rosaria Simone and Francesca di Iorio 


Abstract Tree based methods in statistics are gaining a renewed interest in the 
Big Data era since they entail effective interpretation of results. In this setting, we 
apply a model-based technique to build trees for ordinal responses relying on a class 
of mixture models whose characteristic feature is the probabilistic specification of 
uncertainty. An application to the perception of happiness shows that the integration 
of tree methods with the chosen modelling boosts cluster analysis of respondents. 

Abstract Nell’era dei Big Data, i metodi statistici basati sugli alberi ottengono 
una grande rilevanza poichè permettono un’efficace interpretazione dei risultati. In 
questo contesto, consideriamo una tecnica per crescere alberi per risposte ordinali 
basata su una classe di modelli statistici il cui valore aggiunto è la specificazione 
probabilistica dell’incertezza. La validità dell’approccio e una sua applicazione al 
clustering vengono discusse sulla base di un survey sulla percezione della felicità. 

Keywords: Tree based methods; Uncertainty; Ordinal Responses 


1 Motivations 


Among the consolidated literature on ordinal data analysis [1], an alternative ap- 
proach is based on CUB models [4, 5], whose rationale is that discrete choices arise 
from a psychological process that involves two components: a personal feeling and 
an inherent uncertainty. The effectiveness of this paradigm improves with the in- 
clusion of explanatory covariates for parameters, leading to CUB regression models. 
Lately, tree based methods [2] have gained widespread popularity because they are 
a simple, yet powerful data analysis tool particularly useful to analyze large data 
sets characterized by both qualitative and quantitative covariates. A key advantage 
of trees methodology is the automatic selection of the most relevant covariates as 
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well as their interpretability. 

In the streamline of [6], in [3] the authors proposed a method to grow model-based 
trees in which every node is associated with a CUB regression model. Then, the 
terminal nodes of the tree identify alternative profiles of respondents based on the 
covariates values and classified according to levels of uncertainty and feeling. Ad- 
ditionally, similarity of the clusters so determined can be investigated exploiting a 
graphical feature of the chosen modelling [3]. 

The paper is organized as follows: first we recall the basics of the chosen model- 
based procedure (section 2). Then, in section 3, we present an application to a 
data set investigating the perception of happiness. The whole analysis has been run 
within the free R environment: the code is available upon request from authors. 


2 Background and Methodology 


Trees have proven to be a useful tool for high dimensional data analysis, able to 
capture nonlinear structures and interactions. Growing trees [2] for a response vari- 
able (either continuous or categorical) relies on a top-down partitioning algorithm 
that is known as recursive binary splitting, as it is based on a splitting criterion that 
allows to choose at each tree node (subset of observations), the best split, i.e. binary 
division, of the current node, based on a set of explanatory variables. At each tree 
node f, given an impurity measure I(s,t) that assesses the homogeneity of node f, 
the algorithm chooses the split s* that induces the highest decrease in impurity with 
respect to the child nodes of (t; and t, respectively): 


x= argmax AI(s,t), Al(s,t) = i(t) — [i(t)) py +i(t,) py] 


where p; and p, represent the node weights. Once a node is partitioned, the splitting 
process is recursively applied to each child node until either they reach a minimum 
size or no further reduction of impurity can be achieved. 

The tree methodology based on CUB models (CUB REgression MOdel Trees - 
CUBREMOT for short) has been advanced in [3]. In a nutshell, CUB models paradigm 
[4] designs the data generating process yielding to an ordinal evaluation out of the 
latent perception as the combination of a feeling component (which drives sub- 
stantial likes and agreement)-shaped by a shifted Binomial distribution- and an un- 
certainty component- which is assigned a discrete Uniform distribution. Denoting 
R;= 1,...,m the score assigned by the i-th respondent to a given item of a question- 
naire, we say that R; is a CUB distributed random variable with uncertainty parameter 
x and feeling parameter È (for short R; ~ CUB (7, &)) if: 


m 


Pr(Ri = rim 8) = m( era é+(1 mi) r=1,...,m. 


r 
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In particular, the mixing proportion 7; is an indirect measure of heterogeneity while 
1 — &; indicates a positive tendency in the data w.r.t. the topic under investigation. 
Explanatory variables may be included in the model in order to relate feeling and/or 
uncertainty to respondents’ profiles. Then, consider a CUB regression model with a 
logit link between parameters and a dichotomous factor D: 


logit(x;) = Bo + Bi Di, logit(&)= % +1 Di. (1) 


If no covariate is considered neither for feeling nor for uncertainty, the 7; = 7 and 
È; = È are constant among subjects. Estimation of CUB models relies on likelihood 
methods and on the implementation of the Expectation-Maximization (EM) algo- 
rithm for mixtures. 

Since the process of descending son nodes from a father node is a binary split- 
ting, the starting point to grow a CUBREMOT is the selection of a set of explanatory 
variables to be sequentially transformed and associated with a set of dichotomous 
factors. Then, for a given k > | and a dichotomous variable D, a CUB regression fit 
to the k-th node provides it with a CUB (fx, È) distribution whose log-likelihood at 
the final ML estimates is 4, (7, È,). Then, the split induced by D associates the left 
son node 2k (right son node 2k + 1, resp.) with the conditional distribution R|D = 0 
(R|D = 1, respectively). The proposal in [3] relies on a splitting criterion based on 
the log-likelihood increment from the father to the sons level for each possible split, 
and at the given step chooses the one that maximizes the deviance: 


44 = [Luo (fax, Err) +4 (orti, Ex+1)] — Lf ka) (2) 


(here, n; denotes the size of the sub-sample conditional to D = i,i = 0, 1). Finally, 
a node is declared terminal if none of the available covariates is significant (neither 
for feeling nor for uncertainty), or if the sample size is too small to allow a CUB 
model fit. Figure 1 shows the formal configuration of the split at node k. 


R ~ CUB (âp, êp) 


(R|D =0) ~ CUB (fp, Êg) (RID = 1) ~ CUB (ĉok41: 2k41) 


Fig. 1 CUBREMOT : split at node k 
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3 Application 


The focus of the present contribution is the derivation of a CUBREMOT for mea- 
surements on perceived happiness collected at University of Naples Federico II 
in December 2014 (the data set is available at: http://www.labstat.it/home/wp- 
content/uploads/2015/09/relgoods.txt). Every participant was asked to rate the qual- 
ity and the importance attributed to selected relational goods on a m = 10 point 
ordinal scale (1 = “Never”, “Not at all good”, to 10 = “Always”, “A lot”, “Ab- 
solutely good”), and to self-evaluate his/her happiness by marking a sign along a 
horizontal line (with the left-most bound standing for “extremely unhappy”, and 
the right-most one for the status “extremely happy”). This continuous measurement 
has been uniformly discretized into an ordinal variable Happiness over m = 10 cat- 
egories in order to allow direct comparisons with other questionnaire items. For 
illustrative purposes, the case study considers only few subjects’ dichotomous char- 
acteristics: Gender (1 for women), the marital status Married (1 for married), and 
the smoking habit Smoke (= 1 for smokers). Also the association between happiness 
and relationships with both Friends and Parents is considered (the latter measured 
by a proxy quantifying the time spent with them). 

The final CUBREMOT is displayed in Figure 2 (terminal nodes are squared): at 
each split, the value of the deviance splitting criterion (2) and the sample sizes are 
reported. Noticing that in this context the estimated | — Ê is a direct indicator of 
happiness, Overall people are fairly happy (1 — È, = 0.653) and evaluations are 
affected by a modest level of indecision (1 — # = 0.398). Some main comments 
about CUBREMOT classification can be summarized as follows: 


e The happiest group of respondents corresponds to males giving a low evaluation 
for relationships with parents (Parents < 5) and an extremely positive percep- 
tion about friendship (Friends > 9), with 1 — Êz = 0.823. However, these re- 
sponses are affected by a not negligible uncertainty, indicating that there could 
be unobserved factors, as response styles effects. A comparable level of happi- 
ness is observed within married people that speak very often with their parents 
and that rate their relationships with friends of high quality (with an estimated 
happiness of 1 — È, = 0.808). In this case, this index can be considered also 
as a precise classification measure since the node is characterized by a low un- 
certainty (1 — #3; = 0.258). In addition, friendship is recognized a major role 
in the perception of happiness especially among married people (perceived hap- 
piness increases when descending from node 15). Instead, it is noticeable that 
the married status does not come into play among those not having a good re- 
lationships with their parents (from node 6 downward), whereas married people 
are happier among those evaluating positively their relationships with parents 
(1- É,4 = 0.663 against 1 — i5 = 0.744). 

e The unhappiest are those with a very poor quality of relationships both with 
parents and with friends (1 — È, = 0.147 and a fairly high level of indecision: 
1 — î = 0.641). More specifically, among respondents assessing unsatisfactory 
the relationships with friends (Friends < 5), one observes a sharp improvement 
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Fig. 2 CUBREMOT for Happiness 


from 1 — È, = 0.147 to 1 — È; = 0.501 in perceived happiness as soon as one 
switches from giving a very low judgment (Friends < 3) to a medium-low scoring 
(Friends = 4,5). In addition, the split of node 2 into nodes 4 and 5 has indeed 
identified more homogeneous groups (1 — #2 = 0.705 against 1-4 = 1-75 = 
0.641). 

e By looking at nodes 26 and 27, 112 and 113, and nodes 58 and 59, it can be 
inferred that happiness hinges more on friendships for men than for women. For 
instance, among the unmarried respondents which are satisfied with their family 
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bonds (Parents > 6) and giving the highest evaluation for friendships (Friends 
= 10), females are slightly unhappier than males (1 — ģsọ = 0.681 against 1 — 
sg = 0.733). 


As a by-product, a cluster analysis of CUBREMOT nodes can be performed once 
they are represented as points in the parameter space with coordinates given by 
corresponding uncertainty 1 — x and feeling 1 — €. Figure 3 shows the scatterplot of 
the fist 59 nodes for Happiness: three clusters, highlighted by different symbols, are 
identified using a simple k-means algorithm with k = 3. 


pps 


Fig. 3 CUBREMOT nodes in CUB parameter space 
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Inequalities in access to job-related learning 
among workers in Italy: evidence from Adult 
Education Survey (AES) 


Differenze nella formazione dei lavoratori in eta adulta 


nei risultati della Adult Education Survey (AES) 


Paolo Emilio Cardone! 


Abstract Equitable access to adult learning for all is a goal for European education, 
training and employment policies. In particular, all workers should be able to 
acquire, update and develop their skills over their lifetime. How is it possible to 
improve access to learning for older workers? This report provides a statistical 
picture of older workers participation in job-related training in Italy, investigating its 
variability and relevant inequalities. The analysis is carried out using Italian AES, 
provided by Eurostat. It analyses adults’ learning activities and distinguishes formal, 
non-formal and informal learning. Using logistic regression model it is possible to 
estimate the learning-age gap between those aged under and over 50 years more 
accurately. Overall the data confirm the existence of strong inequalities in access to 
job-related learning among workers. 


Abstract L’allungamento delle aspettative di vita e 1 cambiamenti demografici 
rendono necessario lavorare più a lungo e sostenere una forza lavoro competente, 
adattabile al cambiamento e competitiva. Un equo accesso ai percorsi di 
apprendimento per tutti i lavoratori, in particolare per quelli più adulti, è uno dei 
principali obiettivi della Commissione Europea per le politiche di istruzione, 
formazione e occupazione. In particolare, tutti i lavoratori devono essere in grado di 
acquisire, aggiornare e sviluppare le proprie competenze nel corso della loro vita 
lavorativa. Come è possibile migliorare l'accesso alla formazione continua per i 
lavoratori più adulti? Il presente contributo fornisce un quadro statistico della 
partecipazione dei lavoratori adulti ai programmi di formazione legata al lavoro in 
Italia, indagando la sua variabilità e le disuguaglianze più rilevanti. L'analisi è stata 
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effettuata utilizzando i dati italiani dell’indagine AES, condotta da Eurostat. Essa 
analizza le attivita di formazione degli adulti, distinguendo tra formazione formale, 
non formale e apprendimento informale. Utilizzando un modello di regressione 
logistica è inoltre possibile stimare il divario tra gli occupati over 50 e under 50 con 
maggiore precisione. Nel complesso i dati confermano l'esistenza di forti 
disuguaglianze tra i lavoratori per quanto riguarda l'accesso ai programmi formativi 
legati al lavoro. 


Key words: Age management; Adult education; Lifelong learning; Logistic 
regression model. 


1 Introduction 


Demographic ageing is an irreversible process. The direct effect of population 
ageing is the increasing share of elderly people, who are in retirement age, compared 
to the decreasing share of young people. 

Furthermore, the European Commission 2012 Ageing Report suggests that 
population ageing has been also affecting the age structure of population working 
age. This is extremely important in the overall context of labour force in the EU 
(particularly in Italy). On the labour market, the proportion of jobs that require 
medium and high-level qualifications is expected to increase. However, there is still 
an extremely high number of those of working age in Europe who have either low or 
no qualifications. 

The nature of jobs is changing, necessitating changes in the skills that are 
required of workers and adapting lifelong learning systems to the needs of an ageing 
workforce. The recent crisis has also highlighted the importance of education and 
training at all stages of life, in particular for older adults to avoid unemployment, 
vindicating the messages that “it is never too late to learn” and learning must be for 
all. This requires older people to maintain and update the skills they have, 
particularly in relation to new technologies. Continuous learning and development of 
an ageing workforce are important for employers’ survival in competitive markets, as 
well as for maintaining older people’s employability. 

Equitable access to adult learning for all is a goal for European education, 
training and employment policies. In particular, all workers should be able to 
acquire, update and develop their skills over their lifetime. However, despite the 
increasing need for learning later in life, participation and access to learning 
decrease with age. How is it possible to improve access to learning for older 
workers? This report provides a statistical picture of older workers participation in 
job-related training in Italy, investigating its variability and relevant inequalities due 
to key factors such as the influence of individual characteristics, jobs and 
workplaces. 
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2 Data and methods 


In order to achieve this goal, the analysis is carried out using microdata from the 
second and latest wave of Italian Adult Education Survey (AES-2011), provided by 
Eurostat. The survey analyses the learning activities of adults and distinguishes 
between formal, non-formal and informal learning, which takes place inside or 
outside the workplace. It investigates adult participation in training in depth and 
includes a sample of 11.500 individuals, 6.000 of which are workers (if weighted 
they become 22 million, exactly the workers' amount in Italy). 

Regular participation in learning activities does not include taking part in 
formal training only, but also learning in non-formal and informal learning settings. 
In particular, informal learning plays a greater role for older employees than formal 
learning because it facilitates the transfer of knowledge and know-how between 
generations, allows practical skills to be gained quickly and ensures the inclusion, 
particularly for older workers, within the circles of relationships. 

The organizing concept of the CLA (Classification of Learning Activities) 
is based on 3 broad categories: Formal Education (F), Non Formal Education (NF) 
and Informal Learning (INF). It is possible to classify all learning activities into 
these 3 categories using some general concepts and definitions: 

Lifelong Learning (LLL) is defined as encompassing “all learning activity 
undertaken throughout life, with the aim of improving knowledge, skills and 
competences, within a personal, civic, social and or employment related 
perspective.” 

Formal Education as “education provided in the system of schools, colleges, 
universities and other formal educational institutions that normally constitutes a 
continuous “ladder” of full-time education for children and young people, generally 
beginning at age of five to seven and continuing up to 20 or 25 years old. Formal 
education refers to institutionalised learning activities that lead to a learning 
achievement that can be positioned in the National Framework of Qualifications 
(NFQ). 

Non Formal Education is defined as “any organised and sustained 
educational activities that do not correspond exactly to the above definition of formal 
education. Non-formal education may therefore take place both within and outside 
educational institutions, and cater to persons of all ages. Non formal education 
programmes do not necessarily follow the “ladder” system, and may have a differing 
duration. Non-formal education refers to institutionalised learning activities, which 
are not part of the NFQ. 

Informal Learning is defined as “...intentional, but it is less organised and 
less structured ....and may include for example learning events (activities) that occur 
in the family, in the work place, and in the daily life of every person, on a self- 
directed, family-directed or socially directed basis. Informal learning activities are 
not institutionalised. 
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The National Framework of Qualification (NFQ) is defined as “the single, 
nationally and internationally accepted entity!, through which all learning 
achievements may be measured and related to each other in a coherent way and 
which define the relationship between all education and training awards”. 
In synthesis, the process to allocate education and learning according to the 
broad categories is presented in the decision making flowchart shown in Figure 1: 


Figure 1 — Allocation of learning activities according to the 3 broad categories 


As shown in table 1, descriptive analisys shows a strong inequalities between under 
and over 50 workers for all broad categories. 


Table 1 — Learning activities participation according to the 3 broad categories (% values, 
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Source: own elaboration on AES data 


' The entity can take the form of an organization/body, or regulatory document. It stipulates 
the qualifications and the bodies that provide or deliver the qualification (awarding bodies) 
that are part of the National Framework of Qualifications. 
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Using multivariate analysis (logistic regression models with Stata software) it is 


possible 


to estimate the learning-age gap between those aged under and over 50 


years more accurately. The model has been developed for employed adults only and 


includes, 


first of all, adults’ socio-demographic characteristics (age, gender and 


citizenship), secondly, job and size enterprise. 

In order to achieve this goal, we have used “Learning” as the dependent variable 
(weighted model). Learning=1 if the worker has participated at least one training 
activity (formal, non formal or informal). Concretely, in our study the following 
variables are considered: 


Gender. Categorical. Dummy variable: Female, Male (reference cat.). 
Citizen. Categorical. Three values. Italian citizenship (reference cat.), Other 
citizenship UE, citizenship extra UE. 

JobISCO. Categorical. Nine levels. Elementary occupations (reference 
cat.), Managers, Professionals, Technicians and associate professionals, 
Clerical support workers, Service and sales workers, Skilled agricultural, 
forestry and fishery workers, Craft and related trades workers, Plant and 
machine operators, and assemblers. 

Sizefirm. Categorical. Four intervals. From 1 to 10 (micro, reference cat.), 
between 11 and 49 (small), between 50 and 249 (medium) and more than 
250 (large). 

Age. Dummy variable: Over 50 (reference cat.), Under 50. 


Table 2 — Logistic regression models 


Variables Beta ODDS Sign. 

e Gender 

Male (ref.) Female -0,10 0,91 0.174 

e Citizen 

Italian (ref.) EU -0,58 0,56 0.025 
Extra EU -0,56 0,57 0.001 

e Size firm 

Micro (1-10) (ref.) Small (11 - 49) 0,19 1,21 0.019 
Medium (50 - 249) 0,48 1,62 0.000 
Large (250 +) 0,59 1,81 0.000 

e Job ISCO 

Elementary occupations (ref.) Managers 1,24 3,44 0.000 
Professionals 2,24 9,44 0.000 
Technicians and associate professionals 1,53 4,61 0.000 
Clerical support workers 0,96 2,61 0.000 
Service and sales workers 0,66 1,94 0.000 
Skilled agricultural, forestry and fishery workers 0,84 2,32 0.002 
Craft and related trades workers 0,34 1,40 0.018 
Plant and machine operators, and assemblers 0,55 1,74 0.000 

e Age 

Over 50 (ref.) Under 50 0,20 1,22 0.009 
Intercept -0,65 0,52 0.000 


Source: own elaboration on AES data 
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3 Conclusions 


One principal finding of such an analysis is that people under 50 have a 
probability of 1.22 and higher of participating in training when compared to those 
aged 50 and more (table 2). Secondly, women are less likely to take part in training 
than men. 

Overall the data confirm the existence of strong inequalities in access to job- 
related learning among workers: foreing individuals, in micro and small enterprises 
and in occupations with lower skills participate in job-related learning to a much 
lower extent. 

This requires policy attention, to increase the focus on job-related training as 
part of active labour market policies, to prevent skills’ obsolescence. In addition, it is 
important develop a “learning culture”. It is a key factor for increasing the 
productivity of older workers increasing e.g. the capacity to deal with technological 
change (“it is never too late to learn”). 

However, it will be crucial to increase the level of continuous vocational 
training for all workers in future. 

This is (or should be) the real challenge. 
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Signal detection in high energy physics via a 
semisupervised nonparametric approach* 


Individuazione di un segnale fisico mediante un 
approccio non parametrico semi-supervisionato 


Alessandro Casa and Giovanna Menardi 


Abstract In particle physics, the task of identifying a new signal of interest, to be 
discriminated from the background process, shall be in principle formulated as a 
clustering problem. However, while the the signal is unknown, usually even miss- 
ing, the background process is known and always present. Thus, available data have 
two different sources: an unlabelled sample which might include observations from 
both the processes, and an additional labelled, sample from the background only. In 
this context, semisupervised techniques are particularly suitable to discriminate the 
two class labels; they lies between unsupervised and supervised ones, sharing some 
characteristics of both the approaches. In this work we propose a procedure where 
additional information, available on the background, is integrated within a nonpara- 
metric clustering framework to detect deviations from known physics. Also, we 
propose a variable selection procedure that allows to work on a reduced subspace. 
Abstract Nell’ambito della fisica delle particelle la ricerca di un segnale di interesse, 
che si manifesta come una deviazione dal processo di background, può essere for- 
mulata in termini di problema di raggruppamento. Tuttavia, mentre la presenza del 
segnale non è certa, lo è quella del background, che rappresenta un processo noto. 
Nelle analisi empiriche, si dispone non solo di dati non etichettati, che potrebbero 
contenere segnale, ma anche di un campione di dati etichettati, provenienti dal solo 
processo di background. Ha senso allora adottare un approccio semisupervisionato, 
che si colloca a metà strada tra i metodi supervisionati e non. In questo lavoro si pro- 
pone una procedura che integra l’ informazione aggiuntiva a disposizione a tecniche 
di clustering non parametrico per individuare deviazioni dalle teorie fisiche esist- 
enti. Viene inoltre proposta una procedura di selezione delle variabili che permette 
di operare su un sotto-spazio ridotto. 
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Key words: high energy physics, nonparametric clustering, semisupervised classi- 
fication 


1 Introduction 


Since the early Sixties, the Standard Model has represented the state of the art 
in High Energy Physics. It describes how the fundamental particles interact with 
each others and with the forces between them, giving rise to the matter in the uni- 
verse. Despite its empirical confirmations, there are indications that the Standard 
Model does itself not complete our understanding of the universe. Model independ- 
ent searches aim to explain the shortcomings of this theory by empirically looking 
for any possible signal which behaves as a deviation from the background process, 
representing, in turn, the known physics. 

The considered problem can be recasted to a classification framework, although 
of a very peculiar nature. While the background process is known and a sample 
of virtually infinite size can be drawn from it, the signal process is unknown, pos- 
sibly even missing. Available data have, consequently, two different sources: a first, 
labelled, sample from the background class only, and a second, unlabelled sample 
which might include observations from the signal. A semisupervised perspective [2] 
shall be then adopted, either by relaxing assumptions of supervised methods, or by 
strenghtening unsupervised clustering structures through the inclusion of additional 
information available from the labelled data. 

In [5], the problem has been faced by building on a suitable adaptation of para- 
metric density-based clustering to the semisupervised framework, according to the 
same logic of anomaly detection tasks. In this work we follow a similar route, yet in 
a nonparametric guise. Such formulation appears consistent with the physical notion 
of signal, i.e. a new particle would manifest itself as a peak emerging from the back- 
ground process. Nonparametric -modal- clustering, in turn, draws a correspondence 
between groups and the modal peaks of the density underlying the observed data. 
Thus, the one-to-one relationship between clusters and modes of the distribution 
would provide an immediate physical meaning to the detected clusters. 

The main idea underlying this work is to semisupervise nonparametric cluster- 
ing by exploiting information available from the background process. Specifically, 
we tune a nonparametric estimate of the unlabelled data by selecting the smoothing 
amount so that the induced modal partition will classify the labelled background 
data as accurately as possible. As a side contribution we propose a variable selec- 
tion procedure, specifically conceived for this framework, linked to the concept of 
stability of the distribution underlying the data. 

We adopt the following notation: % = {Xi }i=1,....mp denotes the set of labelled 
data, supposed to be a sample of iid multidimensional observations from the back- 
ground distribution fp. Since the background is known and well explained by the 
existing physical theories, we may assume ny to be as large as needed to estimate 
fp arbitrarily well. 2p; = {xi}i=1, n, has the same structure as 2%, and denotes 
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the unlabelled set of data, assumed to be drawn from the distribution fps underlying 
the whole process. We assume that fps and fp could be different just because of the 
presence of a signal which features as a new mode of fps, not arising from fy. 


2 The statistical framework 


According to the nonparametric formulation of density-based clustering, the ob- 
served data 2° = {x;}i=1,.. n Xi = (Xi1,---,Xij,--- Xia)! € RY are supposed to be a 
sample from a random vector with unknown probability density function f, whose 
modes are regarded as the archetypes of the clusters, in turn represented by the sur- 
rounding regions. After building a nonparametric estimate f of f, the identification 
of the modal regions may occur according to different directions. One strand of 
methods looks for an explicit representation of the modes of f and associates each 
cluster to the set of points along the steepest ascent path towards a mode, e.g. via 
the mean-shift algorithm. A second class of methods does not attempt explicitly the 
task of mode detection but associates the clusters to disconnected density level sets 
of the sample space, as the modes correspond to the innermost points of these sets. 
See [4] for a review of these approaches. 

Whatever direction is followed, any estimate of f leaves defined the modal struc- 
ture and hence the clustering. However, nonparametric density estimation is a crit- 
ical task, at least with respect to two aspects. First, the shape and the number of 
modes of the density estimate depend on the regulation of some smoothing para- 
meter, whatever estimator is chosen. While not binding, in the rest of the paper, we 
focus on the specific case of product kernel estimator: 


n d Oo Seas 
LIIK (78 4), o) 


n-hé 4 


f(x; X, h) = 


where K is the kernel, usually a symmetric probability density function, and h > 0 is 
the bandwidth. A large bandwidth tends to oversmooth the density, possibly pulling 
out its modal structure, while a small bandwidth favours the appearance of spurious 
modes. How to set the amount of smoothing is then an issue to be tailored. 

A second critical aspect related to density estimation, and worth to be accounted 
for, is the dimensionality of the problem at hand. The curse of dimensionality is 
known to have a strong impact on nonparametric density estimators and this explains 
a worsened behaviour of modal clustering for increasing d. In high dimensions, 
much of the probability mass flows to the tails of the density, possibly giving rise 
to the birth of spurious clusters and averaging away features in the highest density 
regions. Since a typical aspect of high dimensional data is the tendency to fall into 
manifolds of lower dimension, dimension reduction methods are often advisable. 
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3 A nonparametric semisupervised approach 


Our contribution, to include a source of supervision in nonparametric clustering, 
builds on the idea of exploiting available information on the known labelled pro- 
cess to aid the most critical aspects of the nonparametric framework, i.e. density 
estimation in high dimensions and selection of the smoothing amount. 

To address the former issue, here we propose a variable selection approach spe- 
cifically formulated for this context. The procedure is based on the idea that a pos- 
sible different behavior between fp and fps shall be only due to the presence of 
a signal of interest in fps, hence, a variable will be considered to be relevant if it 
contains any trace of signal. The approach here adopted consists in comparing re- 
peatedly the estimates of multivariate marginal distributions of fj, and fps, at each 
step on a different, randomly selected subset of variables. In this way we operate 
in lower dimensional spaces, with a gain in density estimation accuracy, while ac- 
counting for relations among variables. The comparison is based on the use of the 
non parametric statistic [3] to test equality of two distributions. If a different beha- 
vior is detected, the procedure updates a counter for the variables selected at that 
step; at the end of the procedure the counter will indicate the relevance of each 
single variable. The procedure allows for selecting a smaller subset of variables to 
work with, leading both to interpretative and computational advantages. 

To address the second critical aspect discussed above, we propose a procedure 
whose rationale is the following. We identify the modal partition of the unlabelled 
data associated with the nonparametric estimate fos which guarantees the most ac- 
curate classification of the labelled background observations. Given an estimate f, 
of fp, supposed to be arbitrarily accurate due to our knowledge of the background 
process, a partition 7,( 2%) of the background data remains associated. Then, we 
get multiple estimates Foss Xs, hbs) Of fos for hps varying in a range of plausible 
values. Each of these estimates identifies a partition Y,,(.2%}s) and, eventually, also 
a partition Y,,(.2;) of the background data, both defined by the modal regions of 
fos. The latter classification is obtained by assigning a background observation to 
the cluster of fos for which its density is the highest. Z,s(%}) is then compared 
with (4) via the computation of some agreement index J. The bandwidth Aps 
that maximizes / is then selected to estimate fps and identify the ultimate partition 
Prs Lbs). The main steps of the procedure are listed in the Pseudo-algorithm 1. 

From an operational point of view we use, to obtain partitions, the cluster- 
ing method [1] and, as agreement index, the Adjusted Rand Index. Furthermore, 
Prs Xp) and Z(%}) are not, in fact, compared on the whole background sample 
Z but on a number of different bootstrap samples from 2%; this allows us to get 
the empirical distribution of the agreement index / and obtain more reliable results. 

Eventually we note that, besides the background process is known and 2 is 
arbitrarily large, the procedure presented above requires an estimate of the back- 
ground density fp, i.e. the relative choice of the bandwidth. To this aim we rely 
on the concept of stability of the density estimate and select the bandwidth that 
minimizes the integrated squared distance among density estimates computed from 
different samples drawn from the background process. 
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Pseudo-algorithm 1 Semisupervised procedure for bandwidth selection 
Denote with: 2%, the background sample, 2p; the unlabelled sample from the whole 
process; it is assumed that the dimensionality of both samples has been already re- 
duced via variable selection. Let hp be the background bandwidth; hps the whole 
process bandwidth (to be determined); hg,ig: a grid of plausible values for hps. Fi- 
nally let P(X) be a partition of data 2 identified by the modal structure of 
density fx and I(@, A) an agreement index between partitions Y and Z 
Input 25, 2b, hy, heria- 
: compute ho: Lo; hy); 
: obtain P, (2p); 
: for hin Agia do 
compute fis (+; Zs, h); 
obtain Yp;(.2%); 
compute I (Y,(2,), Aos(2)). 
end for 
: Ips = argmaxnengrial (P5(4); Prs(2p)) 
: compute TRE Ls; hbs); 
: obtain Zys(Zbs); 


SeEMIADNAR WHE 


= 


Output: Pps (Zis). 


4 Empirical results 


In this section, we show the results of the application of the proposed procedure on 
a Monte-Carlo physical process simulated within the CMS experiment; the exper- 
iment refers to high-energy proton-proton collisions where each observation cor- 
responds to a single collision event and may produce particles from two different 
physical processes: the QCD multijet background, and a signal known as top pair 
production. 2%, includes ny = 20000 background observations, while 2}, include 
Nps = 10000 observations, whose the 16% comes from the signal process. For each 
dataset we observe d = 30 variables related to the kinematic characteristics of the 
particles produced by the proton collisions. While both 2%, and 2%}, are labelled, 
labels of .&}s have been employed only for evaluating the quality of the results. 

In Figure (1) the results of the variable selection procedure are displayed. Two 
features (dp/2 and jcsv1) show a remarkably different behavior between the back- 
ground and whole process densities. In the subsequent analyses we have worked 
with these two variables only. 

After estimating fp based on a bandwidth selected to guarantee the density sta- 
bility as explained in the previous section, we applied the procedure reported in 
Pseudo-algorithm 1. The obtained bandwidth was used to estimate the density fps 
of the whole process and thus obtaining a partition of 2%}, via the subsequent ap- 
plication of nonparametric clustering. Results, reported in the right table of Figure 1, 
compare the obtained partition with the known actual labels. The procedure identi- 
fies four different groups: two of them clearly refer to the background process, while 
the other two mostly contain observations coming from the top pair production sig- 
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40- 


Clusters 
Label 1 2 3 4 
M Bkg[0.410 0.421 0.003 0.004 


Sgn]0.008 0.011 0.088 0.055 
aul 


Variables 


Clusters 
Label 1 2 3 4 5 6 7 8 9 10 


Bkg|0.056 0.180 0.066 0.073 0.097 0.099 0.092 0.033 0.139 0.004 
Sgn|[0.003 0.002 0.004 0.016 0.002 0.002 0.006 0.003 0.003 0.120 


Fig. 1: Top left: results of variable selection procedure; variables are ordered decreasingly by 
importance (higher bar implies higher importance). Top right: true process labels vs 
clusters detected by the proposed semisupervised procedure. Bottom: true process labels 
vs clusters detected by the benchmark method [5]. 


nal. The overall misclassification error is equal to 2.6% with a true positive rate 
larger than 88%. For comparison purposes we also present results of the application 
of the competitive methodology proposed in [5]. Data dimensionality have been pre- 
viously reduced by keeping four principal components, as proposed by the authors. 
Working in a parametric framework there is one-to-one relationship between mix- 
ture components and clusters; hence the method find 9 background clusters and an 
additional one capturing the signal. The overall error is equal to 4.5% with a true 
positive rate amounting to the 74.5%. 
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Employment study methodologies of Italian 
graduates through the data linkage of 
administrative archives and sample surveys 


Metodologie per lo studio dell’occupazione dei laureati 
Italiani attraverso il data linkage di archivi 
amministrativi con quelli di indagini campionarie 


Claudio Ceccarelli, Silvia Montagna and Francesca Petrarca 


Abstract We discuss the issues and the related study methodologies raised by the 
data linkage among different Istat archives to provide information on the employ- 
ment status of Italian graduates. To this aim many different administrative archives 
are integrated with data from sample surveys on university graduates’ vocational 
integration. From this integration, a very complex situation emerges which must be 
analysed and correctly interpreted. In this paper in order to show the feasibility of 
our method, we discuss the comparison among these sources and we present the 
strategy of constructing appropriate indicators. 

Abstract Si discutono le problematiche e le relative metodologie di studio neces- 
sarie messe in luce dall’integrazione di differenti archivi Istat per ottenere infor- 
mazioni sullo stato occupazionale dei laureati Italiani. A questo scopo, sono stati 
integrati diversi archivi amministrativi con i dati provenienti dalle indagini campi- 
onarie sull’inserimento professionale dei laureati. Da questa integrazione è emersa 
una situazione molto complessa che deve essere analizzata e correttamente interpre- 
tata. In questo articolo, per mostrare l’applicabilità del nostro metodo, si discute il 
confronto tra le diverse fonti e si presenta la strategia per costruire indicatori ap- 
propriati. 
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1 Introduction 


In the last few years, Italian universities have shown an increasing interest in study- 
ing the characteristics of the labour market demand for their alumni in line with the 
requests coming from the Italian Industry [4]. Periodic reports have been produced 
by public institutions, for example, the Italian National Institute of Statistics (Istat) 
[10]-[9] or private entities, such as [5] and AlmaLaurea [1]. 

The information on the contracts obtained by each graduate retrieved in the inte- 
grated database include data about the type and the job qualification, the sectors of 
economic activity, the location of businesses and the educational curriculum of the 
alumni in the period from upper secondary school diploma to university graduation. 

A general plan of modernization which includes the Italian Statistical Institute 
[11]-[2] and other National Statistical offices has the purpose to produce the best 
possible estimates to meet user needs from multiple data sources, from surveys, 
administrative archives and new sources such as big data, and moreover to reduce 
burden and costs. Of course the problem of the integration of administrative archives 
to produce useful statistics is an issue also addressed and discussed in the rest of the 
world, (see, [6] and [8]). 

In this paper, data are used to develop an explorative study on the transition of 
graduates into the Italian labour market taking advantage from data coming from 
administrative archives which guarantee sophisticated and robust statistical analyses 
[7]-[3]-[12]-[14]. Our database also contains data coming from the sample survey 
on university graduates’ vocational integration. 


2 Sources of data 


For the first time, administrative data concerning Italian graduates in the year 2011 
and their employment status information in the following four years are available. 
These data can be used for a benchmark among the answers given by the interviewee 
and the evidence from administrative sources. It must be anticipated that the study 
has a strong experimental character dictated by various reasons mainly related to 
criticalities of the available sources. 

The main data sources are: 


1. Survey sample on university graduates’ vocational integration on graduates in 
2011: the interview conducted in 2015, four years after graduation, detects the 
training paths and employment outcomes in relation to different moments or time 
intervals: 


Before graduation; 

at moment of graduation; 

at one year from graduation; 

at the moment of the interview. 
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Survey final data are stored in an archive called Armida, which contains data 
related to 58,400 individuals. For more details on the survey see [9] and [10]. 

In Tab. 1 we report the percentage of Italian graduates who were working at the 
moment of the degree, after one year and after four year for different level of 
Italian University degree coming from the survey. 


Table 1 Percentage of Italian graduates which work at the moment of the degree, after one year 
and after four year for different level of Italian University degree 


At the moment of grad.|1 year from grad.|4 years from grad.| 
First cycle degree (bachelor) 28.7 37.4 72.8 | 
Second cycle degree (master) 34.7 55.7 84.5 | 
Single-cycle master degree 27.0 40.3 80.3 | 


Sample Survey Source 


2. The National Register of students (called ANS), Ministry of Education Univer- 
sity and Research (MIUR) source, provides, for each individual, the personal data 
and its university career from enrolment to graduation. In this paper, we consid- 
ered all the graduates who got their university degree in the year 2011. Tab. 2 
reports the number of graduates in 2011 for different levels of Italian University 
degrees. 


Table 2 Numbers of graduates in the 2011 for different levels of Italian University degrees 


Graduates 
All 299,449 
First cycle degree (bachelor) |169,232 
Second cycle degree (master)|86,593 
Single-cycle master degree |43,624 


MIUR administrative Source 


3. The Integrated Base of Administrative Sources of Istat contains the employment 
status of the Italian population. This administrative archive records all the busi- 
ness relationships of the Italian population. In order to identify a business rela- 
tionship, it is necessary to have in the database an administrative record confirm- 
ing a relationship with an employer (evidence of type LEED-Linked Employer 
Employee Dataset or Database!). The administrative record must be related to 
a contributory position (INPS source) or a social security position (INPDAP 
source), or other event associated with the worker and recorded in one of other 
archives on employment available at the time of the analysis. [13]. 

Therefore, in this case the employment status of the graduates is given through 
their contributory position or social security position. This information is recorded 
for each month of the year. The main issue with administrative data is that we 


' LEED is the result of research linking every employer to its employees and vice versa every 
worker to his employer. 
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must define a proper definition of graduate employed. In this study, we define 
worker a graduate for whom at least a contributory position in a month of the con- 
sidered year has been recorded in the Integrated Base of Administrative Sources 
archive. 


—@— First cycle degree 


Second cycle degree 


—te=Single-cycle master degree 


In the year of 1 year after Year 2013 Year 2014 
graduation graduation 


Fig. 1 Behaviors of employment in the subsequent four years after graduation for different levels 
of Italian University degrees coming from the administrative archive. 


We report in Tab.3 data for the employment of the graduates in the year of grad- 
uation and after graduation. Looking at the year 2014, first cycle degrees have less 
chance of getting an employment respect to the other kind of degrees. The varia- 
tion of employment, in the four years, can be seen in Fig. 1; this shows a similar 
behavior for the curve of the second cycles and single-cycle master degree. The two 
curves increase up to the year 2013 and then there is a saturation. In the case of 
the first cycle degrees, the curve grows almost linearly. In the last column of Tab. 3 
are reported the percentage of graduates who have an employment for each year of 
the subsequent four years after graduation. Also for this type of graduates, the first 
cycle degrees are the less favored. 


Table 3 Percentage of Italian graduates who work in the four years after graduation for different 
levels of Italian University degrees 


in the year of graduation|1 year after graduation|2013 |2014 |ALL | 
First cycle degree (bachelor) |33.41 44.07 47.66|55.54 22.22] 
Second cycle degree (master) |42.33 63.58 69.58 /69.56 31.77] 
Single-cycle master degree  |36.13 59.25 66.17|62.16|28.11] 


Integrated Base of Administrative Sources of Istat 
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3 Comparison 


The statistical sample of the graduates in the 2011 (58,400 units) is merged with 
the Integrated Base of Administrative Sources with the aim to compare the percent- 
age of working graduates. We call this data combined. In first column of Tab. 4, 
we show the percentage of graduates which are recognized as working graduates at 
one year from graduation by analysing combined data. These percentages are ob- 
tained by analysing the administrative information contained in the combined data 
records. For the sake of comparison, in the second column of Tab. 4 are reported the 
percentages obtained by the sample survey information contained in the combined 
data records. 


Table 4 Comparison between the percentage of Italian graduates who work at one year from grad- 
uation. 


Combined data 


Administrative information|Sample survey information 
First cycle degree (bachelor) 37.86 37.40 
Second cycle degree (master) 55.06 56.70 
Single-cycle master degree 40.89 40.30 


We observe small differences of the results: the administrative information of the 
combined data are slightly higher. This was expected because the use of adminis- 
trative data reduces the possibility to loose units, in fact the results of the sample 
survey are subjective and over dependent on the memory of the interviewee. More- 
over, sometime, during the interview, people do not declare employments which are 
not coherent with the educational path or of short duration. It is worth nothing that 
in the case of administrative data the indicator of working graduates is able to get a 
better classification of the working units. 


We underline the importance of using administrative data for the study of the 
entrance of graduates into the Italian labour market. We have briefly shown that 
an administrative archive is flexible and rich enough to analyse the work paths of 
graduates in the years following graduation. Moreover, the administrative data allow 
us to study the evolution of graduates after the graduation and therefore to analyse 
changes in their job position. We plan to perform these studies through longitudinal 
analyses. The analyses on the Integrated database could be adopted as a permanent 
monitor of the entrance of graduates into the Italian labour market over the years 
which may be accompanied from the sample survey. 
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Abstract. A series of challenges have recently emerged in the data mining field, 
triggered by the rapid shift in status from academic to applied science and the 
resulting needs of real-life applications. The recourse to statistical learning models 
as support vector machines (SVM), or neural networks, is a common practice. 
However, the performances of those algorithms strongly depend on the quality of 
the data used. This constraint, oblige the data scientist to employ different statistical 
methods before using those algorithms. This paper aims to apply feature selection 
method on financial data of 20 firms in order to to set up our Support Vector 
Machine (SVM) Model through which we can predict firms’ creditworthiness risk. 


Key words: Feature Selection, Dimensionality Reduction, Factor Analysis Model, 
SVM, Intelligent Financial Solution, Financial Health. 


1 Introduction 


Nowadays, due to the development that sciences and technologies have known 
lastly, several data anomalies has appeared. The most common anomalies that we 
find in real world data are in general the incomplete records, irrelevant and/or 
redundant pieces of information, imbalanced class distribution and imbalanced error 
costs, but the most redundant problem stills the big size of data. 

Indeed, many scientific studies are featured by the huge number of variables used. 
Because of these big numbers of variables that are into play, the study can become 
rather complicated. In these cases, dimensionality reduction techniques are highly 
used. In this paper, we perform a dimensionality reduction using correlation matrix 
and factor analysis. The reduced dataset will be used as an input of a Support Vector 
Machine algorithm in order to generate a prediction model. Experimental results on 
a set of financial data for 20 Moroccan firms will be presented and discussed. 
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This paper is structured as follows. Section 2 presents an overview on feature 
selection and dimensionality reduction, in section 3 we will present the SVM 
approach. In section 4 we present the data set and the experimental results. Finally, 
the last section provides some concluding comments. 


2 Feature Selection And Dimensionality Reduction of Data 


Feature selection techniques have become an apparent need in many different fields 
because of the constant growth of data. The objective of feature selection is three- 
fold: improving the prediction performance of the predictors, providing faster and 
more cost-effective predictors, and providing a better understanding of the 
underlying process that generated the data [1] 

Among the most used methods of feature selection we found the factor analysis 
which aims to reduce “the dimensionality of the original space and to give an 
interpretation to the new space, spanned by a reduced number of new dimensions 
which are supposed to underlie the old ones” [2] 

The Factor Analysis Model is one of several techniques which seek to explain the 
correlation between a set of variables by a smaller set of random variables. It uses a 
small number of imaginary variables to express the basic data structure by studying 
the internal relationship between the variables, and reflects the main information and 
interdependence of these original data. 

The following assumes that the p observed variables (The X;) that have been 
measured for each of the n subjects have been standardized. 


Xy = aF + + UimEm + 21 
Xa = Az1F, ++ AamFin + 22 


Xp = ap1F1 +++ + GpmFm + ep 


F; are the m common factors, e; are the p specific errors, and the aj; are the factor 
p x m factor loadings. 
In matrix form this can be written as: 


Xpx1 = ApxmFmat + Epx1 


It doesn’t make sense to use factor analysis if the different variables are unrelated; 
this is why the starting point of factor analysis is a correlation matrix. 

The correlation Matrix provides the inter-correlations between the studied 
variables. The dimensionality of this matrix can be reduced by looking for 
variables that correlate highly with a group of other variables, but correlate 
very badly with variables outside of that group. These variables with high 
inter-correlations could well measure one underlying variable, which is 
called a factor. 
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3 Support Vector Machine Model 


Support Vector Machine (SVM) is a powerful method for pattern recognition and 
classification introduced by Vapink [3]. The SVM maps the input data into a higher 
dimensional feature space via a nonlinear map and construct a separating hyperplane 
with maximum margin. It has been proposed as a technique in times series 
prediction. The key characteristic of SVM is that a nonlinear function is learned by a 
linear learning machine in a kernel induced feature space while the capacity of the 
system is controlled by a parameter that does not depend on the dimensionality of 
the space. The following shows the SVM algorithm [4]: 


Consider a given training set fxj, y;jii=lL-l } randomly and independently 
generated from an unknown function, where x; € XcR", yeYcR and I is the 


total number of training data. 


The SVM approximates the unknown function using the following form: 
f(x) =(w, DM) +b (1) 
Where (.) is the dot product, w and b are the estimated parameters and ® is a 


nonlinear function used to map the original input space R” to high dimensional 
feature space. So, the nonlinear function estimation in original space becomes linear 
in feature space. 


The optimization goal of standard SVM is formulated as: 
Minimize sh? +C; +&) (2) 
Subject to: 
yi- (wø) -b< E+ éj, 


(w, @(x;)) +b- yi <e+&, 


Where the constant C>0 determines the tradeoff between the flatness of f and 


the amount up to which deviations larger than are £ tolerated and é;, &; are slack 


variables and they are introduced to accommodate, respectively, the positive and the 
negative errors on the training data. The formulation above corresponds to dealing 
with the so called £ -insensitive cost function: 


0 iflej<e 
lle = 4 th < 
-€ otherwise 


The nonlinear function is obtained as: 
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f= Ba; -ai j )K (xxi) +b (5) 


Where K(x;,x;)= (x; Gr; )) is defined as the kernel function. The elegance of 


using the kernel function is that one can deal with feature spaces of arbitrary 
dimensionality without having to compute the map ® . Any function that satisfies 
Mercer’s condition can be used as the kernel function. 


4 Proposed method and experimental results 


In this paper, we propose to reduce the dimension of a real world data set using 
factorial analysis before using the reduced data on a SVM in order to generate a 
prediction model. 

For our experiences, we choose a financial data over 3 years 2009-2011, which all 
come from 20 Moroccan companies that belong to different sectors and have 
different sizes. Firstly, we have selected 39 variables, over 3 years, which are 
impacting the financial health of companies: 


Table 1: Selected variables 


Tumover Equity / Total assets v2 ‘imancial costs / Gross op- 
cain pre 
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Net cash la V29 | Gross operating profit 
Tenover 


i "iii 
bee 
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a 
tomers 
dors 

Vil | Free cash flow V24 | Cash flow / Tumover = State 


When we assess a firm’s creditworthiness risk, we want to know if a company is 
solvent, if it is profitable and if it is still productive; this is why we have computed 
the correlation coefficients, using SPSS 10, between our 39 variables involved and 
the three relevant components: Solvency, productivity and profitability, in order to 
detect the variables that influence the most 
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We next reduced the dimensionality of the input factors using Principal Factor 
Analysis (PFA) approach by SPSS10, in order to obtain factors that create a new 
dimension and that can be visualized as classification axes along which 
measurement variables can be plotted [5]. 

Table 2 shows the 8 factors that will be retained instead of 39 variables. The eight 
factors based on the data from 2009-2010 can explain 86% of the variance 
contribution, which means the model has a good measuring effect. So, we can 
reduce the input factors dimension to eight factors to predict the crediworthiness 
risk. 


Table 2: Input Factors. 
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As cited befor, we use SVM to predict the model for the reduced data. To obtain a 
good performance, we have carefully chosen some parameters that include the 
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regularization parameter C, which determines the trade-off between minimizing the 
training error and minimizing model complexity, and parameters of the Kernel 
function. 

In this simulation we test the classification using the kernel function RBF so two 
parameters need to be chosen; they are the y width of the RBF function and the soft 
margin parameter C of SVM [6]. 

One method often used to select the parameters is grid search on the log ratio of the 
parameters associated with cross-validation. Value pairs (C, y), respectively was 
assessed using cross-validation and then we have chosen the pair with highest 
precision: (C, y) = (100, 0.1). 

According to the architecture of the support vector machine, only the training data 
near the boundaries are necessary. In addition, because the training time becomes 
longer as the number of training data increases, the training time is shortened if the 
data far from the boundary are deleted. Therefore, we have implemented a sample of 
40 Moroccan companies whose financial data is extracted over (2009-2011) and 
reduced on our 8 factors analysis. Then we have applied our SVM model over the 
training set on a new sample of 20 Moroccan companies whose financial data is 
selected over (2009-2011), with the purpose to measure the precision of 
creditworthiness risk prediction as compared to the actual data of 2011. 

In order to test the effectiveness of the proposed method, a series of simulations 
were carried out to predict solvency, productivity and profitability, as follows: 
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© Real 
0.8 Î 
® 
Me of J | 


2.2 4 


DA 1 1 1 1 1 1 1 1 1 
0 2 4 6 8 10 12 14 16 18 20 


Figure 1: Solvency risk prediction 
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Figure 2: Productivity risk prediction 
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Figure 3: Profitability risk prediction 


As proved by the results above, the fact that the precision of the creditworthiness 
risk prediction is about 90% means that the model has a good measuring effect. 


5 Conclusion 


We presented in this paper a simple and efficient way to improve the 
performance of the SVM predictor and this by reducing the dimensionality of the 
training set using factor analysis as feature selection technique. Experiences were 
generated on a sample of financial data of 20 companies over 2009-201. 

The simulation results show that our SVM model gives good precisions, and that 
we are able to forecast the companies’ default and to give intelligent financial 
solutions to investors and financial institutions to help them in decision-making. 
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We consider this study as a beginning of a line of research in which we will 
explore more parameters that can improve the performance of prediction. 
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Contribution of extracting meaningful patterns 
from semantic trajectories 


Sana Chakri, Said Raghay, Salah El Hadaj! 


Abstract Explosive growth in geospatial and temporal data emphasizes the need for 
automated discovery of spatiotemporal knowledge, Different algorithms have been 
proposed in the last few years for discovering different types of behaviours in 
trajectory data, integrating trajectory sample points with geographical and contextual 
data before applying mining techniques can be more gainful for the application 
users. It contributes to produce significant knowledge about movements and provide 
applications with richer and more meaningful patterns. Trajectory Outliers are a sort 
of patterns that can be extracted from trajectories. We propose a new approach for 
trajectory outlier detection based on semantic data besides than geometric data. 


Key words: moving objects analysis, spatial databases, data-mining, semantic 
clustering, semantic trajectories 


1 Introduction 


Researchers from spatial databases, GIS, data mining, and knowledge extraction 
communities have developed several techniques for mobility analysis. As 
consequence three research areas have been expended; The first one focuses on data 
modelling to provide definitions and extensions of trajectory related data types such 
as moving objects, points, lines, or regions. The second deals with data management 
to optimize the storage of mobility data with suitable indexing and querying 
techniques. And the last one that is the main topic of this research deals with the 
analysis of patterns that can be extracted from stored data like trajectories by using 
spatiotemporal data mining algorithms. Several data mining methods have been 
proposed for extracting patterns from trajectories. However, the majority of them use 
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trajectories without looking for any additional information, and yet by considering 
only the raw trajectory data, discovering why an object followed a different route 
become very complex since no additional information (called semantic) is given 
about the moving object. This additional information can hide behind a lot of 
meanings; in fact it can lead to a better understanding of the patterns extracted. This 
is can be achieved by combining the raw mobility tracks (e.g., the GPS records) with 
related contextual data in order to use semantic trajectories instead of focusing only 
on the geometric side of trajectories [1, 2]. Semantics refers essentially to additional 
contextual and geographical information available about the moving object, apart its 
position. Semantics contain both the geometric properties of the moving object as 
well as the geographic properties and any other additional information like the 
moving object’s activity, mode of transportation, speed or any data that can help give 
more meaning to the behaviour extracted. The purpose of this research is to find 
spatial, spatiotemporal and temporal outliers among semantic trajectories. 

In this paper we try to go further in semantic trajectory outlier detection, in order to 
deduce the possible reason why an object moves differently than its group. More 
specifically, we try to enrich trajectories with semantic data, and then extract outliers 
based on both geometrical and semantic information to give more meaning to the 
behavior extracted. The rest of the paper is organized as follows: section 2 the 
methodology pursued to detect semantic outliers. Section 3 presents the 
methodology to add meaning to patterns extracted, section 4 gives experimental 
results on real trajectory data. Finally, section 5 concludes the paper with discussion 
and comparisons. 


2 Enriching trajectories with semantics 


The purpose is to find spatiotemporal and temporal outliers between regions of 
interest [3], Analysing them with semantic data to understand the meaning of the 
outliers detected. Spatiotemporal outliers refers to sub-trajectories that have spatial 
and temporal difference compared to common trajectories, while temporal outliers 
refers to moving objects that behave spatially like the majority of the other moving 
objects, but temporally they are different; for instance moving objects that took the 
same route but they accelerate or they mark an important number of stops which 
make them seen as suspicious moving objects. The analysis presented in this paper 
are made on sub-trajectories that rely regions of interest which are shapes that have 
different size and format, depending on the application, they can be regions ROI, 
lines LOI, or even points POI, they can be districts, dense areas, hotspots, important 
places, etc. generally a region of interest can be a pre-defined important place or 
computed by an algorithm that finds dense areas. In our case we consider a region as 
a point, line or polygon, which is a well-known concept in GIS community. The use 
of regions allows filtering from the whole dataset only the sub-trajectories that move 
between the same regions, outliers will be searched among these sets what 
significantly reduces the search space for outliers. Among the trajectories that cross 
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all regions, we are only interested in the part of trajectories (sub-trajectories) that 
move between specific regions, we call these sub-trajectories Nominees. After 
defining the set of nominees, we start looking for temporal outliers, and spatial 
outliers in which we extract from them spatiotemporal outliers. A nominee will be a 
spatial outlier when it follows a different path in relation to the majority of the sub- 
trajectories from its group, and it can be a temporal outlier if it follows the same 
path, but shows different behaviours compared to the other moving objects. In 
general, we have two types of path: Populated path that have many trajectories in its 
proximity. And depopulated Path, it has less trajectories around. The spatial and the 
spatiotemporal outliers will be extracted from depopulated paths, while the temporal 
outliers will be extracted from the populated paths. 


3 Giving Meaning to patterns extracted 


After extracting outliers from semantic trajectories, the main goal of the next step is 
to add meaning to the outliers extracted. The next step is about splitting the outliers 
extracted to several types according to their semantic interpretation; 
A. Spatiotemporal outliers 
1) Stop outliers 
It occurs when the moving object made a stop for some time during the deviation. 
We consider as a stop a sub-trajectory that its speed is close to zero for a minimal 
amount of time (MT). 
2) Emergency outliers 
It occurred when the moving object took an alternative route and shows an 
important acceleration of its speed, the reasons can be almost about an 
emergency case like an ambulance transporting patient, or someone trying to 
escape from police, etc. We consider that there is an emergency outlier if the 
speed of the fast outlier is higher than the double of the average speed of the 
synchronized outliers detected in the same derived route. 
3) Regular outliers 
It occurs when the moving object deviates from the populated route without an 
important change of speed, or with a degradation of speed. This may reveal that 
the populated route is temporarily busy or is under reconstructions, or there is an 
accident, or even there is an event that block the path, so the moving object is 
forced to deviate from the populated route, Which can cause a big traffic on the 
alternative ways, and as consequence, the speed of the moving object may 
decrease. Our algorithm assembles all these reasons in three types of outliers: the 
blocked route outlier, the avoided route outlier, and the traffic jam outlier: 

a) Blocked route outliers: Expresses any deviation because something 
happens close to the populated route which causes some blockage, for 
instance, an accident, and route reconstructions. To discover the reason, 
we start by analysing only the part of the closest populated route deviated 
by the outlier, then we look if there is an activity around the main 
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segments, if yes, we verify the time of this activity to be sure that the 
outlier was generated in the moment of the action. And finally we verify 
that at the time of the activity, there are no synchronized segments in the 
populated route, to prove that the path was blocked by the event. 

b) Avoided route outliers: This type of outliers is similar of the first type, 
the only difference is that there is an activity in the populated route, but 
this activity doesn’t cause any blockage, an example could be a police 
checkpoint. 

c) Traffic jam outliers: Expresses deviations due to a heavy charge at the 
rush hour, it occurs if we found an outlier, but no activity is blocking the 
populated route, so we start looking if there is a traffic jam. For that we 
look for the slow traffic in the populated route at the time of the outlier. 

B. Temporal outliers 

Temporal outliers are common trajectories that follow the populated route, but 
with an important difference of the speed compared to the other common 
trajectories. We extract two essential types; temporal emergency outliers, and 
temporal stop outliers. 

1) Temporal emergency outliers 

It occurred when the moving object stay in the populated route but shows an 
important acceleration of its speed, the reasons can be almost about an 
emergency case. We consider that there is a temporal emergency outlier if the 
speed of the fast common trajectory is higher than the double of the average 
speed of the synchronized common trajectories detected in the same populated 
route. 

2) Temporal stop outliers 

the temporal stop outliers are common trajectories that Travers the populated 
route with a very slow speed compared to the synchronized common 
trajectories in the same route, it occurs when the moving object made a stop for 
some time in the populated route. We consider as a temporal stop outlier a sub- 
trajectory that its speed is close to zero for a minimal amount of time (MT). 


4 Experimental results and comparison 


For the experimental results we try to analyse real data sets to prove the efficiency of 
our method, these datasets are taken from [4, 5, 6, and 7]. It contains trajectories of 
Trucks dataset which consists of 276 trajectories of 50 trucks delivering concrete to 
several construction places around Athens metropolitan area in Greece for 33 
distinct days. Notice that we analysed only trajectories from Monday to Friday The 
structure of each record is as follows: {obj-id, traj-id, date(dd/mm/yyyy), 
time(hh:mm:ss), lat, lon, x, y} where (lat, lon) is in WGS84 reference system and (x, 
y) is in GGRS87 reference system. These datasets are interesting for analysing 
outliers because this type of drivers, in general, knows different routes to reach the 
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same place. Therefore, we can find the alternative routes (outliers) in relation to the 
standard path. 


oF 
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Figure 1: Trucks trajectories Figure 2: Common Trajectories Figure 3: Outliers extracted 


Table 1: Trucks outliers extracted 


Expected Spatiotemporal Common Temporal 
Nominees 
outliers outliers trajectories outliers 
35750 9402 1157 26348 421 


Table 2: Semantic spatiotemporal outliers trucks trajectories 


Spatiotemporal outliers 


Stop Emergency Regular 
631 
512 14 Blocked route Avoided route Traffic Jam Others 
14 345 225 47 


Table 3: Semantic temporal outliers from trucks trajectories 


Temporal outliers 


p Emergency 
387 43 


In this experiment we consider as interesting regions the districts around Athens 
metropolitan area. The application domain data are all about information about 
drivers, the number of students for the school buses, the type and the number of 
products for the trucks, the noun of the districts and the activities of the drivers and 
regions in this period. The closest algorithms to our approach are TRAOD algorithm 
[8] and Tra-SOD algorithm [9]. Until now we don’t have experimental comparisons 
between the algorithms because the three algorithms provide different outputs, but 
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we tried to classify some characteristics of each of them in the table below to clarify 
the important difference between their characteristics. The algorithm TRAOD does 
not consider regions, the main route and it does not perform any further analysis over 
outliers. The algorithm TRA-SOD considers regions and searches for the main route, 
and gives some further analysis for the outliers extracted. While the algorithm SOA 
present two levels of classification of the outliers extracted; the first one is the speed 
to classify the emergency outliers, then use semantic to reveal the reasons for the 
regular outliers by extracting the reason of such type of outliers. The table below 
resume the characteristics of each one the three algorithms. 


Table 4: comparison between algorithms 


TRAOD Tra-SOD SOA 
Region of interests No Yes Yes 
Populated/ Depopulated route No Yes Yes 
Spatial / 
No No Yes 
Level of temporal 
classification Speed No No Yes 
Semantics No Yes Yes 
Spatiotemporal Temporal 
Stops 
Geometrical Avoidance Stops 
Types of outliers extracted Stops 
outliers Traffic Emergency 
Emergency 
Jam Regular 


5 Conclusion 


In this paper we have shown that trajectory knowledge discovery depends directly on 
the application domain, there is a need to integrate geographic information into 
trajectories in order to create semantic trajectories before extracting meaningful 
patterns. Several algorithms have been proposed for trajectory data mining, but only 
a few consider semantics, and very few of them deal with semantics on trajectory 
outlier detection. In this paper we gave importance to outliers extracted from 
semantic trajectories, the algorithm shown in this experiment finds the main route 
that the majority of trajectories followed, then detect all other deviations that 
trajectories can followed to reach the same place, after that the algorithm divided the 
results to spatial or spatiotemporal outliers according to their natures, the next step 
will be the interpretation of the outliers detected, since the semantic data allow more 
understanding to the behaviors detected. 
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Towards The Register-Based Statistical System: 


A New Valuable Source for Population Studies 


Chieppa A., Ferrara R., Gallo G., Tomeo V.! 


Abstract 

The strong effort in micro-level integration among different statistical sources 
together with the availability of an increasing number of administrative archives is 
determining a big change in the processes that the National Institutes of Statistics 
adopt to produce population statistics. The Italian National Statistical Institute, Istat, 
is planning a new design for the next Census round, based on a convenient 
integration of administrative data and surveys. 

A thematic database has been created to study how administrative sources could 
improve the quality and information of population registers: sources integrated are 
official municipal population registers and Istat statistical population together with 
administrative archives from labour market, education, data on income and taxation. 
The aim of this work is to point out how this integration of data in proper registers 
could allow discovery of new relevant information about population: clusters of 
individuals determined by patterns emerging when analyzing ‘signals of presence’ in 
different sources and their geographical distribution could be of great interest for 
population studies. Moreover, emerging patterns could be very useful to design 
population surveys and producing population counts: both for definition of statistical 
models to assess accuracy of population enumeration and for implement sampling 
frames, including Populations Census ones. 

Key words: Permanent census, administrative data, Integrated microdata system, 
groups of population 
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For years, main sources for population studies and statistics have been demographic 
surveys and Population Census on one hand, municipal population registers on the 
other. In the past, the integration among these sources was set up at aggregated 
level: results of Census were used to increase accuracy of municipal population 
register; current demographic survey results were used to update Census population 
counts to provide intercensal estimation or projection; and so on. Nowadays, the 
strong effort in micro-level integration among different statistical sources together 
with the availability of an increasing number of administrative data archives is 
determining a big change in the processes that the National Institutes of Statistics 
adopt to produce population statistics. 

The population Census still remains the largest and most important statistical data 
collection to provide population figures at the smallest geographical units: while 
most statistical advanced countries still use the traditional scheme, with complete 
enumeration of population and housing units (i.e, Usa and Canada), an increasing 
number of countries base their Census production on statistical registers. Census 
register-based can use exclusively registers data, as in the case of Scandinavian 
countries (Netherlands, Sweden, Denmark, Finland and Norway), or can use a 
combination of both registers and sample surveys data within the frame of the so- 
called ‘combined Census’ (Spanish 2011 Population Census, for example). 

The Italian National Statistical Institute, Istat, is planning a combined Census 
scheme for the next Census round, by conveniently carrying on the integration of 
administrative data and surveys and then exploiting this new informative richness. 
During the last decade, Istat has being increasingly adopting administrative sources 
for statistical purposes. Municipal population registers were the first administrative 
microdata sources used in the field of populations statistics: from 2011, Istat has 
been yearly collecting individual and households archives from this source. These 
data were used to define the primary list of households respondents for 2011 Census. 
Currently, variables collected from population registers (i.e, citizenship, age, gender 
and place of residence) are used to fit the sampling social surveys (i.e., Labour 
Force, Living Conditions, EuSilc and Consumer Expenditure surveys). 

Starting with Population Census microdata and adding vital events (births, deaths, 
internal and international migrations), Istat has been computing a statistical 
population register, the so-called “ANagrafe VIrtuale Statistica” (ANVIS). This 
register ensures higher level of quality than municipal administrative population one 
and represents a solid component for a frame of register-based production for 
population statistics. 

Since 2015 many experiments at Istat has being investigating the use of other 
administrative sources to increase quality and information derived from population 
registers. To manage the increasing number of administrative data sets and to 
maximize the benefit, Istat built an integrated system of available administrative 
sources, named SIM (Integrated System of Microdata). When a new administrative 
archive is loaded in this system, recognition processes identify any individual or 
economic unit present in data and assign it a permanent and unique identification 
number (ID): if the unit is already present in Istat databases, this ID is the same the 
unit was assigned in the past. ID assignment and the creation of proper links among 
archives coming from different sources is the ‘core’ of microdata integration. Then, 
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starting from this base, it is possible to construct specific data structures for 
statistical processes and to create thematic (di Bella and Ambroselli, 2014). 

The SIM, among all the archives loaded, comprises data coming from municipal 
population registers and ANVIS, permits to stay, data referring to employees and 
self-employed workers, compulsory education students, university students, retired 
people, non-pension benefits records, and individual data on income and taxation: 
these integrated data have been used to create a thematic database to study how 
administrative sources could improve the quality and information of population 
registers (Chieppa et al, 2016). 

The aim of this work is to point out how this integration of data in proper registers 
could allow knowledge discovery of new relevant information about population, not 
directly deriving from exploitation of singular archives: clusters of individuals 
determined by patterns emerging when analyzing ‘signals of presence’ in different 
sources and their geographical distribution could be of great interest for population 
studies. Moreover, emerging patterns could be very useful to design population 
surveys and producing population counts: both for definition of statistical models to 
assess accuracy of population enumeration and for implement sampling frames for 
population surveys, including Populations Census ones. 

The process of knowledge discovery to extract useful information need expertise in 
domains of specific administrative sources, together with expertise in population 
studies and technical skills: so a multidisciplinary team and tasks are needed. This 
collaboration is essential, among other things, when selecting specific administrative 
sources from all those available and identifying their hierarchy: labor and education 
registers rank higher than the other sources given that they provide more detailed 
information on the territorial level and activity duration. 

With the goal of discover useful information to increase population registers quality, 
signals of presence of individuals on the national territory are extracted from the 
sources integrated in the database. These signals were deeply investigated to identify 
more discriminant attributes and possible latent dimensions. Once relevant features 
of signals have been selected or derived (in case of latent ones), they are used to 
study specific clusters of individuals useful for the permanent Census purposes. 

The duration of signals for each individual has result of a big importance, together 
with their geographical distribution. Signals of administrative data can represent a 
temporary or occasional presence, so it is necessary to carry out a characterization 
process, by the means of analysis of data and subsequently constructing calculated 
variables. Patterns of signals duration emerged: continuous and steady signals, 
seasonal pattern, new entrance, and so on. This goal was achieved thanks to the 
longitudinal perspective that integration of microdata allows. 

So, signals can be used to derived new relevant variables for related individuals and 
their type of living conditions. More specifically with regard to population counts, 
this new information could identify cases of permanent presence that correspond to 
the usual residence definition and concept of the international regulations, that not 
always correspond to what is record into official administrative population registers. 
Demographic variables, especially gender, age and country of citizenship as well as 
the location of the signal on the territory have proved to be very significant variables 
in defining specific sub-population profiles 
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A more detailed analysis of the characteristics of sub-populations at risk of over- 
and under-coverage in populations registers has highlighted some important topics 
and specific clusters of individuals. The division into different clusters reveals the 
sub-populations that require further thematic analysis: for example, the typical 
foreign communities that elude the registers, but perform specific labor activities, or 
people who frequent certain territories. These same groups can form the basis for 
defining a census procedure formulated on the use of mixed techniques that combine 
specific surveys and appropriate statistical models. 

The potential over-coverage of population register counts 3 million individuals, 
three out of four have Italian citizenship, and the foreigners are on average more 
than six years younger than the Italians. This sub-population mainly consists of 
people of working age (15-64 years) and it is linked to the geographical areas where 
unemployment is higher such as the municipalities in the South and in some central 
areas of Italy. 

When considering under-coverage, the analysis shows the presence of several 
distinct clusters. The continuous signals showing stable presence on the territory are 
related to just over 400 thousand people who are mainly foreign nationals. 
Geographical location and specific citizenship are essential for identifying the cross- 
border workers, whose absence from the population register is admissible. 

Analysis revealed that also some individuals with weak (not continuous) signals 
may be associated with stable presence on the territory, and for this reason an 
improved characterization of signals is needed. 

Clusters of individuals emerged, relevant to assure the quality of populations counts, 
could be effectively used to measure improve quality of Census survey: contributing 
in the sampling design and also to properly assess quality of enumeration counts. 
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Consulting, knowledge transfer and impact case 
studies of statistics in practice 


Consulenza, trasferimento della conoscenza e casi 
reali di impatto pratico della statistica 


Shirley Coleman! 


Abstract There is a growing interest by companies in learning more about how 
data science can benefit them by analysis of their company data. This has led to a 
flurry of requests for consultancy work and a corresponding bottleneck of people 
available to do the work. Many companies value the stamp of a university and highly 
qualified business development staff are being employed by universities to bring in 
more outside interest partly to achieve the impact case study requirements. This 
mismatch can be problematic but does not detract from the satisfaction of having so 
many enquiries. This talk will describe some of the interesting work currently 
undergoing at Newcastle University with small and medium enterprises. 

Abstract Negli ultimi anni si osserva da parte delle aziende un crescente 
interesse per come il Data Science può dare dei vantaggi nell’analisi dei loro dati 
aziendali. Tutto ciò ha portato ad una imponente richiesta di consulenza, ma con 
una conseguente carenza di persone competenti disponibili ad effettuare il lavoro. 
Molte aziende attribuiscono valore al marchio di una università, mentre le stesse 
università stanno impiegando personale altamente qualificato nello sviluppo 
commerciale per suscitare interesse all'esterno e in parte qualificare l'impatto dei 
loro casi di studio. Tale discrepanza potrebbe essere problematica, ma questo non 
riduce la soddisfazione di avere così tante richieste. Questa presentazione sarà 
dedicata ad alcuni interessanti progetti con imprese medio-piccole che sono 
attualmente in corso presso l’Università di Newcastle (UK). 

Key words: data science, data analytics, SMEs, business development 


1 Introduction 


The practical application of statistics is fundamental to the scientific approach to 
managing a business and running a successful industry. Although many companies 
have made use of statistics, many sectors are slow in the uptake and barriers in 
communication hinder a more complete realisation of the benefits. The extent of the 
successes and losses of opportunity varies across Europe and it has been a great 
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advantage to have a pan European network of practitioners who can share 
experiences and learn from each other. 

Statistics in practice is the main focus of the European Network of Business and 
Industrial Statistics (ENBIS) which was set up in 2000 as a network of statistical 
practitioners. The Industrial Statistics Research Unit (ISRU) in the School of 
Mathematics and Statistics at Newcastle University, UK was one of the founder 
members of ENBIS and has supported its activities ever since its launch and 
throughout its 17 years of annual conferences, subject-focused Spring meetings and 
other activities. 

ISRU is a self-financing consultancy unit dedicated to spreading the message of 
statistics in practice. The unit was set up by G. Barrie Wetherill in 1980 as a 
response to his growing involvement with process industries, in particular the 
massive ICI which employed over 20,000 people in nearby Teesside. ISRU evolved 
from providing consultancy in statistical process control (Wetherill and Brown 1992) 
and design of experiments to tackling the underlying issues of understanding the 
needs of industry and implementing the changes needed to realise the benefits of 
statistical interventions. The work involved methods from Total Quality 
Management and applying the Six Sigma approach. 

ISRU functions in a deprived economic area where heavy industry such as 
shipping and mining have given way to light engineering and many start up SMEs. 
Typically these SMEs were slow to take advantage of innovation in terms of new 
management methods, partly because of their lack of finance. The importance of 
European funding was soon realised as a way to give SMEs an equal chance of 
improving the quality of their products and processes as that enjoyed by large 
organisations. Several million £ of support was obtained for helping companies make 
use of statistics and take a more scientific approach to problem solving. Amongst the 
EU projects that ISRU proposed was substantial funding to set up the thematic 
network called pro-ENBIS. This initiative lead to the growth of ENBIS into the 
established and well-respected group of statistical practitioners that it is today. 

In response to the increasing availability of data and awareness by companies of 
its importance, ISRU developed data mining offerings for business and industry and 
carried out several projects with SMEs part funded by the UK government. These 
included developing and implementing segmentation methods in retail, explorations 
of the Kansei Engineering approach (http://www.kansei.eu), statistical training in the 
healthcare sector and extensive data analytics in the gas industry. 

The interest in statistics and data mining is now escalating with many SMEs 
making serious enquiries as to what support is available for them. ISRU has the 
capability to help companies analyse and benefit from their data. There is now an 
issue of how to service these requests for consultancy services. University staff are 
fully occupied with research and teaching and do not have time and motivation to 
carry out consultancy tasks. However, all UK universities are now required to 
demonstrate the impact of their academic activities although enthusiasm for taking 
on consultancy is lacking, the increasing important of impact should help make it 
more appealing. 
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ISRU is in a pivotal position to help academics demonstrate impact. One major 
impact theme is around big data analytics for SMEs. The rest of this paper will focus 
on this theme. Section 2 describes SMEs, gives some background to big data and 
considers SME relationships with big data analytics, section 3 describes a case study 
from this impact work, section 4 discusses UK university impact requirements, 
section 5 considers the implications of the growth in data science for statisticians and 
for the practice of statistics, and section 6 is the conclusion. 


2 SMEs and their relationship with big data analytics 


Small to medium enterprises (SMEs) have fewer than 250 employees; a turnover of 
less than £40m or a balance sheet less than £34m. They are considered to be the 
driving force of the economy employing over half of EU employees. Eurostat shows 
that the % of total value added by SMEs in 2014 was around 60% on average in 
different EU countries (http://ec.europa.eu/eurostat/). However, it is noted that SMEs 
are behind in their commercial exploitation of big data: 

“SMEs are lagging behind in the usage of business and big data analytics. In 
2012, the adoption rate of big data analytics among UK SMEs was only 0.2 per cent, 
compared to 25 per cent for businesses with over 1,000 employees (e-skills UK 
2013). Market studies expect an annual growth rate of the global SME big data 
market by 42 percent over the period of 2013 until 2018 (TechNavio 2014). “ 

Big data analytics here refers to the analysis of the masses of data regularly 
collected by businesses as part of their everyday activities. It includes data from 
sales, inventory, logistics, quality improvement, new product development and 
promotion. In particular where companies are concerned, the opportunities lie in the 
monetisation of company owned data in conjunction with secondary and open data 
(Coleman, 2016). 

Compared with larger organisations, SMEs have 


e Less money, available staff and access to consultants 
e Greater concern with privacy, security and confidentiality 
e Andare more risk averse 


The ENBIS community is keen to address this shortfall and are in process of 
composing a European co-operation in science and technology (COST) funding bid, 
http://www.cost.eu/. As background to the proposal, ENBIS members published a 
paper reviewing the state of the art and the needs (Coleman et al 2016). One of the 
findings reported in this paper is that SMEs have 14 main challenges which are 
concerned with skills, operational issues and practicalities. These are fully described 
in the paper and in summary are: 


e = Skills: Lack of understanding of big data; Dominance of domain specialists; 
Cultural barriers and intrinsic conservatism; Shortage of in-house data science 
expertise; Bottlenecks in the labour market. 
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e Operational issues: Lack of business cases; Shortage of affordable consulting; 
Confusing software market; Lack of intuitive software; Lack of management 
and organisational models. 
e Practicalities: Concerns on data security; Concerns about data protection and 
privacy; Over-focus on venture concept; Financial barriers. 


Of these challenges, the ENBIS COST proposal will in particular address the 
lack of business cases; shortage of affordable consulting and financial barriers. There 
are currently few examples of data analytics case studies in SMEs. Ahlemeyer- 
Stubbe et al (2014) give recipes for dealing with company data but the focus is 
mainly on marketing issues. This lack of business cases will be ameliorated by 
making exemplars and case studies available. Consulting companies are disinclined 
to offer their services to SMEs as they prefer larger more lucrative contracts; the 
COST project will tackle this problem initially by clarifying what services are 
available, helping SMEs find assistance and facilitating the communication between 
SMEs and consultants, and rationalising the kinds of intervention that will be 
beneficial to SMEs. As more and more SMEs take up the big data opportunity 
consultants will find them a safer, more appealing prospect and it is to be expected 
that their services will become more affordable as the consultants see the advantage 
of working with SMEs. The financial barriers arise because of the intrinsically low 
cash flow of most SMEs but also because there are fewer avenues for debt finance. 
There is often an asymmetry of information between the SME and lender because it 
is difficult for the SME to express the benefits they expect from investing resources 
in big data analytics. The COST project will help clarify the costs and benefits and 
make financial applications easier to compose and more likely to be successful. 


3 Case study 


An established approach to helping SMEs and large organisations to innovate using 
big data analytics is funded knowledge transfer from universities to businesses. In 
the UK, Innovate UK funds 67% of the costs for an SME undertaking a knowledge 
transfer partnership (KTP) or 50% of the costs for a large organisation. These 
partnerships have to address a substantial, specific research area of need in the 
company and are usually of around 2 years duration. A graduate research associate is 
appointed to work full time in the business whilst being employed by the university 
and being mentored and supervised by an academic for at least 2 days per month. 
KTPs are a 4 way partnership with the company standing to benefit from the work on 
their problem without the risks of employing a new worker, the research associate 
gaining valuable work-based experience, the academic gaining new material for 
teaching and publishing, and Innovate UK showing added value for the funds they 
have spent in terms of increased turnover and profit for a UK company. 

Newcastle University currently has 17 such KTPs including projects in 
agriculture, energy and historic records. KTPs were initiated in 1976 and over the 
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years ISRU staff have supervised projects in retail, finance, shipping, energy and 
assistive technology. One such KTP specializing in assistive technology was 
completed in 2016 and provides a good example of the work. Assistive technology 
refers to equipment and services that assist older people to deal with losing their 
ability to undertake activities of daily living. Activities of Daily Living (ADLs) 
include dressing, toileting, bathing, feeding and moving around. It is not always clear 
which assistive technologies will be the most appropriate for each individual, and the 
current socio-political-medical environment means that there are frequently long 
waiting periods to access expert opinion. It is quite common for products to have 
been developed specifically for an individual who has a particular need or 
requirement, then the product is mass-produced and marketed. However, bespoke 
products are not necessarily suitable for all who present with similar problems. 

The SME in this project had developed an expert system to enable people to 
interrogate information resources and find out which assistive technologies are 
available and are suitable for them in their physical environment. The company had 
14 years’ worth of assessment data resulting in several million cases with data that 
can be indexed by person, assessment question, ADL problem and ADL solution. 
The company was aware that their data is a rich source of insight and that more use 
could be made of it and so they were eager to explore what could be done. 

The KTP was designed to explore the use of data analytics to reveal insight about 
the ageing sector and to be of benefit to the many stakeholders of the company. As 
the company was built around the needs of the public sector, their income was 
dependent upon Government sources, an additional important aim of the data 
analytics was to provide a new revenue stream for the company to make it more 
stable and robust to economic changes. 

The project produced guidelines to help SMEs get started with analysing their 
data: 


Review strategic objectives. 

Review IT options for data manipulation, access, analysis and presentation. 
Identify relevant dimensions of the data. 

Carry out a stakeholder analysis. 

Quantify the importance of each data dimension to each stakeholder 
Determine suitable revenue strategies. 

Use data analytical methods to create insight. 

Pilot and revisit step 1. 


CATA ARTO 


Exemplars were prepared and the completed KTP was considered to be a great 
success by all partners. In particular, the company has completely changed it 
relationship with data and sees data analysis as a key company offering. In addition 
various stakeholders have committed to paying for insight and create the desired new 
revenue stream. 
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4 Impact 


A considerable part of the funding of UK universities is based on impact. 
Universities need to show that their research has a broad and deep impact on society 
and the world at large. In the 2014 research excellence framework 25% of non- 
student related income depended upon impact. Each department had to show: 


e Quantifiable reach and breadth of research impact in the last 5 years on a 
cohesive theme 

e With several >2* underpinning publications over the last 20 years 

e With 1 impact case study per 7-10 academic staff 


ISRU and academic staff who undertake KTPs and consulting are in a good 
position to provide impact case studies. ISRU contributed one of 3 case studies 
submitted by the School of Mathematics and Statistics 
(http://www.ncl.ac.uk/research/ref/unit/uoal0) and staff are currently involved with 
preparing cases and helping other academics for the next research excellence 
framework expected to be in 2020. 

There is a growing mismatch between university administrative staff who want to 
maximise the university income and impact opportunities and academic staff who 
feel they are judged by their academic publications. The growing importance of 
impact may help to bridge this gap. Somehow the profile of staff involved with 
impact needs to be raised and the rewards made commensurate. 

Consultancy does not have to be carried out by a university group, however, 
many businesses like to have the stamp of approval of a university as an independent 
voice on their methods and conclusions. 


5 Consultancy 


A European wide initiative to work with big data and SMEs is just one approach 
to introducing and eventually embedding statistics in companies. Universities and 
colleges have realised the need for more accessible training and there are an 
increasing number of Master’s and other postgraduate courses available for study. 
These are labelled variously as data analytics, business intelligence, data science and 
others. Most such courses offer a combination of statistical training, business 
awareness and IT skills (Coleman and Kenett 2016). 

Data science is a fast growing concept that is becoming accepted in many walks 
of life. It has attracted interest from the claims made on bill-boards and the fact that 
even Governments are giving it extensive attention, for example the UK 
Parliamentary Office of Science and Technology, http://www.parliament.uk/mps- 
lords-and-offices/offices/bicameral/post/work-programme/big-data. Citizen scientists 
are encouraged to collect data and contribute to major research programs. For 


Consulting, knowledge transfer and impact case studies of statistics in practice 311 
example, in the Zooniverse project, https://www.zooniverse.org/, citizens are invited 
to take part in research. 

Data science has caused a stir amongst professions such as statistics and 
operations research because many data scientists do not feel the need to be part of 
these professional societies. The Royal Statistical Society debated whether to include 
data science in all of its special interest sections or to start a new section devoted to 
data science. After considerable discussion it was decided to start a new data science 
section. The president strongly expressed the opinion that data science needs a sound 
basis in mathematics and statistics and that the professions are an intrinsic part of the 
new field of data science. 

In ISRU’s consultancy work we are teaming up with computer science specialists 
and with the client domain specialists so that we are addressing the three aspects of 
data science in a cohesive way. Computer scientist input includes data access and 
manipulation, machine learning methods and bespoke visualisation. Together we can 
produce tailored client solutions which can be implemented and developed within 
the company in a highly satisfactory manner. 

The data science consultancy needs of large organisations are often met by 
services provided as one small part of the offering from big consultancy firms who 
are more familiar with other business activities. There is a large body of statisticians 
working in government, health, finance and drug development but these statisticians 
and statistical groups have limited opportunities to help SMEs. The issue of where 
consultancy units are best placed was discussed in pro-ENBIS and a database of 
consultancy provision across Europe was one of the project deliverables. A 
comparison of academic based and commercial consultancy units was included in the 
project book (Coleman et al 2008) produced as part of the dissemination. This issue 
is still pertinent. Academic staff are judged by the quality and quantity of their 
research output. Research income is not always valued. We have found that it is 
difficult to attract academics to be involved in consultancy, but as stated above 
impact requirements may be the key to solving the issue. 


6 Conclusions 


Consultancy is a vital service getting added value from research and providing real- 
life examples for students. Mechanisms are needed to facilitate the exchange and 
knowledge transfer partnerships are a good method. SMEs are increasing motivated 
to explore data analytic options for their company data. The current focus on impact 
in UK universities is helpful in giving consultancy a higher profile and encouraging 
more staff to service the need. Academic based consultancies have an important role 
to play and need to be encouraged wherever possible. European funding is needed to 
help SMEs to take part in the data analytics revolution. ENBIS is an important peer 
group to co-ordinate such activities. 
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The evaluation of the inequality between 
population subgroups 


La valutazione della disuguaglianza tra i sottogruppi di 
una popolazione 


Michele Costa 


Abstract This paper illustrates the advantages to evaluate inequality between popu- 
lation subgroups with respect to a maximum compatible with the observed data, thus 
going beyond the traditional approach to the analysis of inequality between, where 
the maximum corresponds to total inequality. The new proposal improves both the 
measurement and the interpretation of the contribution of inequality between to total 
inequality. 

Abstract In questo lavoro vengono illustrati i vantaggi di valutare la disuguaglianza 
tra i gruppi di una popolazione rispetto a un massimo compatibile con i dati osser- 
vati, superando in questo modo l’approccio tradizionale alla misura della disug- 
uaglianza tra, nel quale il massimo viene rappresentato dalla disuguaglianza com- 
plessiva. La nuova proposta consente un miglioramento sia nella misura, sia nell’ 
interpretazione del contributo della disuguaglianza tra alla disuguaglianza totale. 


Key words: Inequality between, Inequality decomposition, Gini index, Inequality 
factors 


1 Introduction 


Inequality between population subgroups represents perhaps the most important 
component of total inequality. By means of inequality between, different sources 
of inequality are evaluated and compared, with the twofold goal to detect the main 
determinants of poverty condition and to implement socio-economic policies able 
to reduce poverty. 

The measurement of inequality between can be achieved following different ap- 
proaches, since inequality literature presents a wide collection of contributions on 


Michele Costa 
Department of Statistical Sciences, University of Bologna, e-mail: michele.costa@unibo.it 


313 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


314 Michele Costa 


inequality decomposition. However, the size of inequality between is usually eval- 
uated with respect to its theoretical maximum, which corresponds to total inequal- 
ity, when the inequality within subgroups is equal to 0. The case of null inequality 
within is a quite unrealistic situation, which can be essentially considered as a the- 
oretical reference, without a proper fenomenal correspondence. That is, we really 
do not expect to achieve a situation where each unit of each subgroup possesses the 
subgroup mean. 

Furthermore, by comparing inequality between to total inequality, we can ob- 
serve two unfortunate effects. First, the size of inequality between is frequently 
unreasonably small, thus suggesting a too low influence of the underlying inequal- 
ity factor. Second, the measure of inequality between is strongly influenced by the 
number of subgroups used into the partition of the total population, thus prevent- 
ing a direct comparison between different inequality factors when the number of 
subgroups is not the same. 

In order to overcome these drawbacks, we propose a new framework for the eval- 
uation of the inequality between, where the basis for comparison is not represented 
by total inequality, but by the maximum which can be otained given the observed 
data. We build on [4] and develop new indicators for the evaluation of the inequality 
between. The new indexes allow to assess the importance of the different inequality 
factors into the observed data, thus improving our knowledge of inequality. 


2 Methodology 


In the following we will refer to the Gini index ([1],[5]) as our inequality measure 
and to the Dagum’s decomposition to get the inequality between subgroups. 

For the case of a population disaggregated into k subgroups of size nj, with 
Lio nj = n, the Gini index G can be expressed as follows 


nj 


1 k 
GE ELE Lhe Vhrl (1) 


j=l 


where y is the arithmetic mean of Y in the overall population, yj; is the value of Y in 
the i-th unit of the j-th subgroup and, accordingly, yp, is the value of Y in the r-th 
unit of the -th subgroup. Among the many methods which allow to decompose the 
Gini index (see, e.g., [2],[6],[7]), we use the decomposition proposed by Dagum 
[3], where the differences | Yji— Ynr| in (1) are assigned to G,,, the component of in- 
equality within subgroups, when j = h, to Gy, the component of inequality between 
subgroups, when j # h, ¥j > Yn, Yji È Ynr, and to G;, the component of overlapping, 
when j Æ h, Yj > Yh, Yji < Yar - Globally we have G = Gy + Gp + Gr. 

In the framework of the Dagum’s decomposition, as well as following any other 
approach, inequality between reaches its maximum when two conditions are veri- 
fied. First, the k subgroups should not overlap, that is, the component G; is equal 
to 0. Second, the variability within the subgroups should be equal to 0, that is 
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the component Gy, is equal to 0 and each subgroup unit possesses the subgroup 
mean: yji = ¥j,j = 1,...,k;i= 1,...,n;. On the basis of these two conditions, we 
have Gpmax = G and the evaluation of Gy is obtained by means of the ratio 


Ig, = Gp/G (2) 


We propose to relax the condition Gy = 0 and to compare Gy not to ist theoretical 
maximum G, but to the maximum Gpmax which can be achieved given the observed 
data. That is, we propose to compare Gy not to the unrealistic case of equidistributed 
subgroups, but to a case more coherent and compatible with the data. 

If we maintain the condition of no overlapping, we have that Gymax = G — Gwin, 
where Gwmin is the minimum inequality within, which can be obtained partitioning 
the observed data into k non overlapping subgroups. We have many ways to divide 
n units into k non overlapping subgroups: with the aim of preserving the structure 
of the original partition, we propose two possible solutions. First, we obtain the k 
subgroups by using the original p; = n;/n values, thus keeping the same population 
shares of the original partition. Second, we obtain the k subgroups by using the 
original s; = (n;y;)/(ny) values, thus keeping the original income shares. 

The second step of our method refers to the calculus of Gwmin, the minimum 
inequality within compatible with the new k subgroups. We propose to permute the 
sequence of the p; (or s; for the second solution), to get a set of k subgroups for each 
permutation, to calculate the related G,, and to chose the minimum value among 
all disposable G,,. Let be Gwmin(p) the minimum inequality within, which can be 
obtained by permutating the values p; and, correspondingly, Gwmin(s) the minimum 
inequality within, which can be obtained by permutating the values s;. 

In the last step we derive the new indexes for the evaluation of Gy, obtained as 


IG = Gp/(G—- Guwmin(p)) = Gb/Grmax(p) (3) 


ley) = Gp/(G = Gwmin(s)) mi Gp/Gpmax(s) (4) 


The new indexes depend on the minimum inequality within compatible with the 
observed data and, therefore, are not strongly affected by k as for Jg,. 


3 Case study 


In order to illustrate the advantages of our proposal, we present a case study related 
to the Italian households for the 2014. The data are from the Survey on Households 
Income and Wealth, a multidimensional survey on Italian households performed ev- 
ery two years by the Bank of Italy. The study analyzes the income inequality among 
the Italian households, divided into subgroups by means of two of the main determi- 
nants of inequality: the area of residence of the household and the educational level 
of the head of household. In order to evaluate the effect of the number of subgroups 
on inequality between, we consider the cases k = 2,3,5. 
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Table 1 illustrates the income mean, the population share and the income share 
for the two partitions. From Table 1 it is possible to observe some well known styl- 
ized facts of income inequality in Italy, clearly evident from the values y; and from 
the differences (p; — si). 


Table 1 Mean income, population share and income share for Italian households divided by area 
of residence and by educational level of the head of household, 2014 


Area mean p s Education mean p s 

North West 33750 0.254 0.279 None 14676 0.03 0.02 
North East 35150 0.221 0.229 Elementary 22329 0.20 0.16 
Center 32636 0.202 0.226 Middle school 26753 0.37 0.31 
South 23365 0.244 0.173 High school 35893 0.26 0.31 
Islands 24095 0.081 0.093 University 46641 0.13 0.20 


Our focus is on the effects of the differences between the subgroups on total 
inequality. Table 2 illustrates the Dagum’s Gini index decomposition by area of 
residence. By increasing k, we can observe the usual pattern in inequality decompo- 
sition: the decrease of inequality within G,, and the consequent greater importance 
of inequality between G, and of overlapping component G,. The evaluation of Gy 
on the basis of IG, strictly depends on k: for k = 2 we have that the area of res- 
idence contributes for the 31% to total inequality, while for k = 5 its importance 
rises to the 48%. From Table 2 we can also observe how IG) and IG) are not a 
monotone function of k, since they depend on the minimum inequality within. The 
new indexes show quite similar results, with the contribution of the geographical 
dimension ranging from the 36% for k = 2 to the 50-57% for k = 5. 


Table 2 Income inequality decomposition by area of residence”, Italian households 2014 


k Gw Gb Gt I, IG Gy 
NC, SI 2 0.194 0.107 0.049 0.306 0.355 0.359 
N, C, SI 3 0.125 0.139 0.086 0.397 0.562 0.574 
NW, NE, C, S,I 5 0.073 0.168 0.109 0.479 0.508 0.566 


“ N north, NW north-west, NE north-east, C center, S south, I islands. 


The results related to the decomposition by educational level of the head of 
household are reported on Table 3. The components Gw, Gp, G; show a behavior 
similar to the previous case, however we can observe how Gy has a greater impor- 
tance, while G, is smaller: two signals of a stronger relevance of the educational 
dimension. Ig, confirms this indication, showing higher levels with respect to Table 
2. Also the new indexes are higher, but their increase with respect to the results of 
Table 2 is less accentuated. Moreover, IG) and IG show only slight variations to 
changes of k, thus indicating an important degree of robustness into the evaluation 
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Table 3 Income inequality decomposition by educational level” of the head of household, Italian 
households 2014 


k Gw Gb Gt IG, IG) ley.) 
NEM, HU 2 0.162 0.149 0.038 0.426 0.590 0.626 
NEM, H, U 3 0.130 0.171 0.049 0.487 0.613 0.568 
N,E,M,H,U 5 0.081 0.200 0.069 0.570 0.615 0.613 


“N none, E elementary, M middle school, H high school, U university. 


of Gy. The educational dimension is considered an inequality factor more important 
than the geographical dimension by all indexes, however, within the new proposals, 
the difference between the two factors is not so high as on the basis of Jg,. In both 
cases the new indexes attribute to the inequality factors a stronger role, overcom- 
ing the usual underestimation and truly reflecting the effective importance of these 
determinants of total inequality. 


4 Conclusions 


We propose to modify the traditional evaluation of the inequality between popula- 
tion subgroups by introducing a maximum compatible with the observed data. Our 
purpose is to assess the determinants of inequality with respect to the observed data, 
and not by referring to the unrealistic case of equidistributed subgroups. Our pro- 
posal also allow to strongly reduce the effect of the number of subgroups on the 
evaluation of inequality between. Two new indexes are illustrated and we believe 
that their foundation on observed data represents an improvement for our knowl- 
edge of the inequality structure. 
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Bayesian Non-Negative ¢;—Regularised 
Regression 


Regressione LASSO Bayesiana non negativa 


Costola Michele 


Abstract This paper proposes a novel Bayesian approach to the problem of variable 
selection and shrinkage in high dimensional sparse non-negative linear regression 
models. The regularisation method is an extension of the LASSO which has been re- 
cently cast in a Bayesian framework by Park and Casella (2008). Moreover, to deal 
with the additional problem of variable selection we propose a Stochastic Search 
Variable Selection (SSVS) method that relies on a dirac spik-and-slab prior where 
the slab component induces the sparse non-negative regularisation. The methodol- 
ogy is then applied to the problem of passive index tracking of large dimensional 
index in stock markets without short sales. 

Abstract Jn questo lavoro introduciamo un nuovo metodo per la selezione sparsa 
delle variabili in un modello di regressione lineare quando i parametri di regres- 
sione sono vincolati ad essere positivi. Il metodo di regolarizzazione impiegato è una 
estensione della metodologia LASSO recentemente estesa all'ambito bayesiano da 
Park and Casella (2008). L’obiettivo della selezione dei regressori viene raggiunto 
attraverso l’estensione del metodo Stochastic Search Variable Selection (SSVS) al 
caso di regressori non negativi con introducendo una distribuzione a priori di tipo 
spike-and-slab. La metodologia sviluppata è applicata al problema della repli- 
cazione passiva di un indice finanziario. 


Key words: Bayesian inference, Non-Negative Lasso, Sparsity, Spike and Slab 
prior, Index Tracking. 
1 Introduction 


High-dimensional data analysis dealing with models where the number of parame- 
ters is larger than the sample size, is becoming one of the most important and active 
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research area of statistics. Since the seminal paper of Tibshirani (1996) that intro- 
duced the least absolute shrinkage and selection operator (LASSO), i.e., the first and 
most popular method that can simultaneously perform parameters estimation and se- 
lection in regression models, a number of relevant contributions have been proposed 
with the same purpose of delivering sparse estimators in high-dimensions. The least 
angle regression (LARS) of Efron et al. (2004), the adaptive LASSO of Zou (2006) 
and the group LASSO of Yuan and Lin (2006) are among the most important shrink- 
age methods proposed in the last 20 years. 

In this paper, we proposes a novel Bayesian approach to the problem of vari- 
able selection and shrinkage in high dimensional sparse linear regression models 
where regression coefficients are also subject to non-negative constraints. The reg- 
ularisation method is an extension of the LASSO which has been recently cast in a 
Bayesian framework by Park and Casella (2008), Carvalho et al. (2010) and Hans 
(2009). Moreover, since as realised by Tibshirani (2011) the Bayesian LASSO of 
Park and Casella (2008) based on the Laplace prior does not deliver sparse esti- 
mates, we deal with the additional problem of variable selection using a Stochas- 
tic Search Variable Selection (SSVS) method that relies on a dirac spik-and-slab 
prior where the slab component induces the sparse non-negative regularisation. The 
methodology is then applied to solve the practical issues of passive index tracking 
of large dimensional index in stock markets without short sales. 


2 The Bayesian non-negative /,—-regularised regression 


Let y = (y1,y2,..., yr)! be the vector of observations on the scalar response variable 
Y, X = (x},x),... ,X)! is the (nx p) matrix of observations on the p covariates, 
i.e., Xj = (xja 3 jd 25 ;Xj,p) and consider the following regression model 

(y|X,a,B,07)=N (y| tra+XB, of) (1) 
n(A |T, 0e) = LZ, (a | T, Oe) (2) 

p 
n(B | t,o) = | [2 (6; | T, Oe), (3) 

j=l 


where 17 is the T x 1 vector of unit elements, œ € R denotes the parameter related 
to the intercept of the model, B = (81,2,...,8,) is the p x 1 vector of regression 
parameters and 0? € R* is the scale parameter. We assume equations (2)-(3) as the 
prior distributions on the constant and regression parameters, respectively, which 
are assumed to be exponential distributed 


T T. 
L (x| T, 0e) = - pf È } 10,00) (x), (4) 
E 


E 


where T € R* acts as the shrinkage parameter in the Lasso framework and 0g is the 
A 4 1 
scale parameter. We want to project the regression parameters LB, = (a, B’) onto 
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p-+1 denotes the positive orthant of dimension 
p+ 1. From a Bayesian probabilistic point of view, this projection is equivalent to 
introduce the truncated Laplace prior on £, defined in equation (3). The next propo- 
sition introduces the 4} -regularised version of the static regression parameters. 


the £7 space, i.e., 15+ (B..), where OF 
p+ 


Proposition 1. Applying Bayes’ theorem to the lasso regression model, the posterior 
distribution in orthant-wise Normal 


Le 1 B) © 


z(B, | Y, X, Oe, T) = O (y,X, T, Oe) "E nee 
Dia (B: 2) ae 


where Pr (m,S) = Jot, MN (t|m,S)dt, and 
Êi =Ê,- top'Zı © 
E = 02 (XX) (7) 
Ê. = (XX)! Xy (8) 
X, = [tr X] (9) 


B(y.X,t,6¢) = | (y|X,a,B,02) (0, |1,0:)1g; (B.)4B, (0) 
> ( T ye Py (85,2) IT: 9 (vr | 0,02) 
di Pp+1 (0 | Bi .£) 


; (11) 


and 1p+ is the (p+1) x 1 vector of unit elements. 


2.1 Dirac Spike-and-slab prior 


Using standard notation, let y be the p—vector where y; = 1 if the j-th covariate 
is included as explanatory variable in the regression model and y; = 0, otherwise. 
Assuming that y; ~ Zer(@), the prior distribution for Bj, j = 1,2,...,p can be 
written as the mixture 


1 (Bj | T, oe, @) = (1— @) 60 (Bj) + OL (B; | T, oe), (12) 


where ĉo (B;) is a point mass at zero. The regression model defined in equations 
(1)-(3) with the spike and slab Li prior defined in equation (12) becomes 


z(y|X,a,B,02)=/ (y|tra+XB,0%) (13) 
p 


x(B|7,0:,0)=]][(1-0)&(B)+02(8;|7,0)], a 


j=l 


322 Costola Michele 


while we retain the same prior defined in equation (2) for the parameter a. Un- 
der the spike and slap prior in equation (14), an iteration of the Gibbs sampling 
algorithm cycles through the full conditional distribution 8; |y,X,@,B_;,T 02,0, 
where B_; denotes the vector of regression parameters without the jth element, 
for j= 1, 2. ., p. The next proposition provides the analytical expression for the 
full conditional distribution of B;, for j = 1,2,..., p. 


Proposition 2. Applying Bayes’ theorem to the lasso regression model, the full con- 
ditional distributions of the parameters (a, B) is 


x(a | YX, B-ja) A (a at È) 0 (8). (15) 


with d* = + Li, (vr —x/B) — %, and, for j= 1,2,...,p 


x (B;\y,X,a,B_,,02,t,0) Sm (vx, aren 02,0) ua 
+ 


(1-6 O (1x8; T, Oe, o)) 


1 
x aN (8;18},0} =) (Bj) 
È (È ) 4( i DEC j 
| (16) 
where ® (x) = f* ~ 6 (2) dz and 
Be = (xx) [x (y-tra-X_B_j) ost] a7) 
o = 02 (xx), (18) 


with 
©} (v.X.0,B_,.1.0,,0) = | 2y |X, aB, 02) (B;| 1,0) 1o) (Bj) dP; 
ve 2 (Gr) 001002) 


(o2) 2 exp {- f } () (0 | B;,0?*) 
dp (0 | Baj OE (x. a) 


x , (19) 
DX _Xa gl 2 


and 
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oj (7.x, a, B_j,T, Oe, 0) n fro |X, æ, B, 02) 1 (B; | 7,0) 10) (Bj) 4B; 
_ (1-0) M $ (x 10,02) 
(02)? exp{ P ) 


og 
-1 
0» (018.08 2 (x (o) ) 
x 1 » (20) 
peas: 
and 
5 -1 
| eF) | 
Pi OT 9; 
6$ (y,X,a,B_,,t,0,,0) = È | | (21) 


(1- @) Oe n) (018707) 


where X,.-; = [tr X_;] and B,_;= [a B-l for j= L2 ap. 


2.2 The Gibbs sampler 


The scale parameter 0£ and the shrinkage parameter T, as well as the prior inclusion 
probability @ are parameters that have to be estimated. Common choices for the 
prior on those parameters are of ~ SG (0? |Ao,No), T~ Z (T| Ax, Nr) and œ ~ 
Be(@| ào, No), where (Ag, No, Ar, Nt, A0, No) are prior hyperparameters. Under 
this prior the full conditional distribution of the scale parameter 02 will be equal to 


z(o ly, Xy, By, o) = £9 (0? (Jose) {o {- Di 1(0,>) (02); 
(22) 


where ny is the number of nonzero elements of B, Xy is the (T x ny) matrix collect- 
ing the observations on the variables included in the ieri y e., with yj = 1, B y 


is the (ny x 1) vector of regressors included, and Ac = Ao 4 Y flo = Nod 5Sy 


1 
and Sy = (y —XyB a (y — X,B J . The full conditional distribution of the penalty 


parameter T is 


x(t ly,Xy,a, By, Oe, 0) «G («| Ženi) , (23) 


A ny g 
with parameters A; = 4x4 4 and ñr = N:+ X; = P; . The full conditional distribution 
of the sparsity parameter @ is 
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1(@|y,Xy,@,By,0¢,7) = 2 (0| ho, ño), (24) 


with parameters Žo = Ao + ny and fo = No + p — Ny. 

The final algorithm for the linear regression model consists of choosing the initial 
parameters values (Bee. oW) and iteratively sampling cee o2”, 7), a), 
fork =1,2,... from 


Da) ~alaly,Xy,pe? 0241) 7(&%-1) œk) defined in equation (15 , for 
(i) y, Xy, , Og ; , q (15) 

j= 1,2,...,p5 
(i) Bi ~T (6; | vaga: BE”, 02041) 1), 04) defined in equation (16), 


for j= 1,2,...,p; 
(iip? ® ~T (o ly, Xy a® B®, tD, oD) defined in equation (22); 


(iv) ~m (z | y, Xy, a), BY 62 ot) defined in equation (23); 
Vo ~ 2 (o | y, Xy, al), B®, a) defined in equation (24). 


3 Application to index tracking 


In this section, we focus on the application of non-negative Bayesian regularised 
regression in financial modelling. The performance of the variable selection of non— 
negative regularised regression is tested when the method is applied to tracking the 
index. We use genuine data from the DJIA index. Index tracking is a quantitative 
passive trading scheme which aims to replicate the returns of a given portfolio of 
assets over a certain time horizon by peaking portfolios’ constituents which are 
most correlated with the portfolio returns. Index tracking, which attempts to match 
the performance of index as closely as possible, is one of the most popular topic in 
statistical finance. Two main reasons justify the use of non-negative Bayesian regu- 
larised regression for index tracking. First, statistical modelling of large dimensional 
indexes is a typical high-dimensional problem where the number of regressors p is 
usually larger than the number of observations. Second, for the cost concern, the 
optimal replication index should match the the performance of the entire index by 
choosing a the smallest subset target stocks. Bayesian non-negative regression with 
spike—and-slab prior successfully leads to sparse estimates of the regression param- 
eters leading to tracking solutions where only few regressors are nonzero. 


3.1 Data and results 


Our data set consists of the end-of-week weekly prices of stocks in DJIA 30 Index, 
from 20 March 2008 to 10 March 2017 (the data come from the database). Weekly 
prices are subsequently converted to weekly returns. For a price P,, weekly returns 
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Fig. 1 Replication results of the DJIA 30 index using a window of dimension W = 24 weekly 
observations (about 6 months). (Top left panel) plots the true value of the DJIA index (red line), 
along with the replicating portfolio, (blue line). (Top right panel) posterior inclusion probabilities 
of each regressor averaged over all the subsampling periods. (Bottom left panel) number of non— 
zero regressors for each rolling window estimate. (Bottom right panel) Normal qq-plot of the 
replicating portfolio. 


are defined as r; = pi — 1, fort = 1,2,...,T. Let xit = rit, for i = 1,2,...,30 rep- 
resent the returns of the i-th constituent stock and y; = r represent the return of 
the index. Then we can describe the relationship between x;, and y, by the linear 
regression model defined in (1), where f is sparse since partial replication for index 
tracking which only selects a small subset. The Bayesian non-negative regularisa- 
tion is then repeatedly applied to get the estimation of the regression parameters f 
using a rolling windows estimation of W = 24 observations over the entire sample 
of observations while retaining the last observation of each subsample for tracking 
the index. Estimation results are reported in Figure 1. The top-left panel of Figure 1 
plots compares the cumulative weekly log—returns over the whole sampling period 
of the true index (red line) and the replicating index (blue line). It is evident from 
the figure that the replicating index is quite close to the true index denoting that our 
Bayesian non-negative regularised regression method provide satisfactory results. 
The bottom-left panel plots the number of asset used to replicate the index for each 
estimating windows. The average number of non-zero coefficients is less than half 
of the number of components denoting that our method replicates well the index 
using a quite small subset of constituents. 
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4 Conclusion 


We propose Bayesian non-negative regularised regression that performs parameters 
estimation and variable selection under spike—and-slab—¢; prior via the Stochastic 
Search Variable Selection algorithm in high-dimensional linear regression models 
where regression coefficients are also constrained to be non negative. We propose 
an efficient Gibbs sampling algorithm for posterior simulation from the augmented 
space of parameters and inclusion indexes. In the following empirical application, 
we use the Bayesian non-negative regularised regression to track the DJIA 30 in- 
dex return by choosing a subset of its constituent stocks. We demonstrate that our 
algorithm provides satisfactory results in replicating the index using a quite small 
subsets of constituents. 
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Industrial Production Index and the Web: 
an explorative cointegration analysis. 


L’indice della produzione industriale e il web: un’analisi 
esplorativa di cointegrazione. 


Lisa Crosato, Caterina Liberati, Paolo Mariani and Biancamaria Zavanella 


Abstract In this paper we explore the relationship between the Industrial Production 
Index (IPI), the confidence index for the manufacturing sector and its sub-indexes 
and Google searches for several words linked to the economic situation, for the 
period January 2004 - September 2016 on Italian data. Significant correlations be- 
tween the selected indicators point to probable comovements of same. Adding one 
observation at a time since the first forewarning signs of the 2008 crisis, we find that 
a few Google searches and the IPI cointegrate, particularly during the strong down- 
ward trend leading to January 2009, while no confidence indicators cointegrate with 
the IPI. These findings suggest that concern about economic conditions expressed 
through searches in google and the IPI or the confidence indexes are influenced by 
common circumstances. Recursive forecasts of the IPI through VECM models sug- 
gest that the evolution of the IPI can be well mimicked using the real time Gtrends 
selected variables. 

Abstract In questo articolo vengono esplorate le interrelazioni fra l’indice della 
produzione industriale (IPI), l’indice di fiducia del settore manifatturiero e suoi 
sub-indici e le ricerche su google per tre parole legate alla situazione economica del 
Paese durante il periodo gennaio 2004 - settembre 2016. Le relazioni fra variabili 
significativamente correlate vengono approfondite tramite un’analisi della cointe- 
grazione sulle serie storiche degli indicatori a partire dai primi segnali della crisi 
del 2008. I risultati suggeriscono che la preoccupazione riguardo alle condizioni 
economiche del paese espressa tramite le ricerche online e l’indice della produzione 
o gli indicatori di fiducia delle imprese sono influenzati da circostanze comuni. Una 
previsione sequenziale dell’IPI conclude il lavoro. 


Key words: Industrial Production Index, Big Data, Google Trends, Confidence In- 
dicators, Cointegration 
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1 Introduction 


The Industrial Production Index is probably the main monthly indicator attesting 
the current health of a country’s economy. Accordingly, several contributions in the 
literature proposed simple to complex models to forecast it usually imputing hard 
data as regressors, from macroeconomic variables to business-specific indicators 
(Bodo and Signorini, 1987; Bruno and Lupi, 2004; Hassani et al., 2013). Soft data, 
such as text analysis in media and other sentiment indicators were introduced instead 
by Ulbricht et al. (2016) to predict the German IPI with more than 17,000 models. 

In this paper we intend to pursue much a simpler goal. The idea is to explore the 
comovements of official statistics on industrial production and non-official indica- 
tors built on google searches for words related to the general and personal economic 
situation. The main goal of the paper is to understand whether web based soft index 
numbers together with confidence indicators may help in predicting the hard IPI. 
Our empirical strategy is to proceed by subsequent selection of variables, firstly by 
simple visual inspection on the range of variability and secondly by analysing their 
correlation with the IPI. Were the correlations between the IPI and one or more soft 
indicators significant, one could try and see whether the relationship may be repre- 
sented also through time series modeling. So our third step for selection of indicators 
is to leave behind the stationary ones in order to proceed to the final cointegration 
analysis. Finally we test for more than one cointegration relationship among con- 
fidence indicators, google searches and the IPI, to end up with VECM based short 
term forecasts of the IPI. Our work is on the lines of Daas et al. (2014), however this 
paper differs at least in two aspects, besides the objective variable: to begin with, we 
analyse integration and cointegration between and among indicators in a recursive 
fashion, moreover we also forecast the IPI. 


2 Data description 


This paper makes use of three data sources, two of which official and a third one 
non-official. The first is the Industrial Production Index (IPI hereafter) monthly re- 
leased by ISTAT (Italian Intitute of Statistics) with two months of delay with the 
reference period. The IPI is a 2010 fixed base Laspeyres index and is the main con- 
junctural indicator measuring real output for all facilities located in Italy. 

The second data source is the Italian confidence index for manufacturing, monthly 
released by ISTAT with about 15 days of delay with respect to the interviews. We 
have selected in particular opinions on current level of orders, current economic 
situation, future level of orders and future economic situation and the composite 
confidence indicator. 

The third data source we use is Google Trends, a free tool by Google that al- 
lows to download the number of times a word or a sentence has been searched 
in the Google and YouTube websites. The idea in using Google trends is to build 
statistical variables for measuring the interest of the people of a country in spe- 
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cific ambits over time. The economic literature has been using Google trends since 
its appearance in 2004 (see Hassani and Silva (2015) for a recent review on fore- 
casting using Big Data). Google trends data are released as monthly frequencies 
of searches starting from January 2004, therefore this is the initial date for all 
our time series. Since in this paper we are interested in understanding whether 
Google searches can be considered and used as proxies of the IPI, the words we 
have searched for in Google Trends are related to the economic situation, espe- 
cially regarding general concerns about the economic situation. We searched for 
economic crisis, recovery, GDP, gross domestic product, public debt, spread, reces- 
sion, unemployment, employment, job. We also construct naive composite Gtrends 
indicators by summing up frequencies associated to related words so obtaining 
four more variables: Total cycle=economic crisis+recession+recovery, Total oc- 
cupation=unemployment+employment+job, Total Debt=public debt+spread plus a 
mixed-up variable three words=economic crisistunemployment+public debt. The 
actual Italian words searched for are: crisi economica, ripresa, PIL, prodotto interno 
lordo, spread, recessione, disoccupazione, occupazione, lavoro. 

The official statistics we use in the paper are all expressed as index numbers in 
base 2010, so in order to have a fair comparison we have indexed to 2010 the Gtrends 
data. To this end, the single and composite words monthly frequencies were divided 
by the mean of 2010 respective frequencies. 


3 Methodology and main results 


The time series from the three data sources we used differ at least in two aspects. 
First, the IPI and the confidence indicators (when needed) are published already de- 
seasonalized, while the Gtrends variables must be treated for seasonality. Therefore, 
we apply the R-interface to X13ARIMA-SEATS method by the United States Cen- 
sus Bureau. Second, they are released with different lags with respect to the date of 
the information they are referred to. At the end of each month we dispose of the IPI 
of two months earlier, while confidence indicators and Gtrends variables refer to the 
current month. Accordingly, we shape the data matrix anticipating all confidence 
and Gtrends indicators by two months. All the time series thus obtained are repre- 
sented in figure 1. A quick glance to the series reveals different degrees of variability 
among the time series, highlighting the structural difference among the indicators. 
The flatter series is for sure the IPI, followed by the confidence indicators and the 
Gtrends variables. Gtrends variables are clearly more volatile and subject to sudden 
jumps in correspondance of particular events (for instance, see in figure | the spikes 
in economic crisis from spring 2008 onwards and of three words at the end of the 
Berlusconi Government in summer-fall 2011). 

The final aim of the paper is to explore whether Gtrends variables and confi- 
dence indicators may show some predictive power on IPI. We intend to to this by 
a multivariate time series model (VAR or VECM if any cointegration relationship 
appears). In particular, we adopt a forward approach adding one observation at a 
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Fig. 1 Time series of the selected indicators. Sources: ISTAT official statistics and our own elabo- 
rations on google trends data. 
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time from April 2008 onwards to monitor changes in the cointegration relationship 
during the observed period. Thus, the length of the series increases by one from 52 
to 151 observations. 

We proceed by subsequent selection of the initial variables by first eliminating 
those indicators showing too wide a range of variation (spread, recession and to- 
tal debt). Secondly, we restrict the choice to variables which correlate with the IPI. 
We thus decided to eliminate all variables showing no significant correlation rela- 
tionship with the IPI and, among the remaining, those presenting a correlation co- 
efficient lower than 0.3This way, we have discarded GDP, Gross domestic product, 
total job, public debt. 

The third step in variable selection was to test for the presence of Unit root in the 
series, as a preliminary information for the cointegration analysis. We performed 
the Phillips Perron test for unit root on the whole set of 100X12 series. The p-values 
in figure 2, top panel, show that not all variables are integrated but most importantly 
that the IPI is integrated at least from July 2008 together with all the confidence 
indexes. Among Gtrends variables, Job is never stationary while three words and 
economic crisis are not stationary only over slightly different subperiods. Accord- 
ingly, we have left behind also employment, unemployment, recovery and tot cycle 
(see the remaning series in figure 2, central panel). Now we move to the cointegra- 
tion analysis of the IPI index with one of the remaining variables in turn (i.e. all 
the confidence indicators and three words, job, economic crisis). Results of the En- 
gle and Granger test for cointegration, reported in figure 2 (p-values, bottom panel) 
point to no cointegration neither between the IPI and the confidence indexes, nor be- 
tween the IPI and job. On the contrary, the IPI and three words do cointegrate and so 
do IPI and economic crisis, although there are some spikes when the turbolence in 
the two Gtrends variables is higher. The cointegration analysis between confidence 
indicators and threewords reveals a similar outcome. 

We think this can be viewed as a first result of the paper contributing to de- 
fine a selection strategy for Gtrends variables to augment forecasting models with, 
although at present restricted to this particular case. If variables cointegrate when 
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Fig. 2 Unit root tests (top panel, p.values of the Phillips Perron test), I(1) series (central panel) and 
Engle and Granger cointegration test (bottom panel, p-values of the Phillips Perron test on resid- 
uals). Both tests are applied adding one observation at a time from April 2008 onwards. Sources: 
ISTAT official statistics and our own elaborations. 
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shaped by a common factor or by a combination of common factors, we may tenta- 
tively say that a few of the Gtrends variables and the IPI share some pattern drivers. 

We conclude this exercise with a simple prediction based on a VECM model 
estimated on the IPI, the threewords index and one confidence indicator in turn. 
This is a way to measure the possible contribution to prediction of IPI by one or 
more confidence indicators and to better exploit pieces of information shared by 
Gtrends variables and the confidence indicators, although none of the latter cointe- 
grate with the IPI. Again the VECM model was estimated 100 times on the month 
by month augmented time series, resulting in 100 forecasted values of the IPI, from 
May 2008 to September 2016 (see figure 3). The preliminary Johansen test con- 
firms one rank of cointegration almost always for the confidence indicator on future 
orders and to a minor extent for the composite manufacturing confidence index, 
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while the introduction of current orders or current production indicators seems to 
weaken the cointegration relationship between threewords and the IPI. Therefore, 
the 100 IPI forecasted values in figure 3 are obtained through VECM models based 
on IPI, threewords and the manufacturing confidence index or the confidence in fu- 
ture production. As can be seen predictions closely follow the actual values of the 
production index, in downward as well as in upward changes. The median percent- 
age absolute error is smaller for the confidence in future production (0.9%) with 
respect to the manufacturing composite confidence (1.1%) mainly due to the pro- 
tracted fall in the forecast for April 2009, when the IPI had already turned up. Note 
that these predictions are available two months earlier than the official IPI. 


Fig. 3 Recursive forecast of the IPI by VECM models using one of the listed variables together 
with the IPI and the three words Gtrends variable. Sources: ISTAT official statistics and our own 
elaborations. 
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Comparison of conditional tests on Poisson data 
Un confronto di test condizionati su dati di Poisson 


Francesca Romana Crucinio and Roberto Fontana 


Abstract We compare four conditional tests for Poisson data through a simulation 
study: the exact binomial test, its asymptotic approximation, a Markov Chain Monte 
Carlo test and the standard permutation test. Despite being non-parametric, we ob- 
serve that permutation tests are as effective as the others. From a theoretical point of 
view we justify this result by observing that the orbits of permutations form a good 
partition of the conditional space. 

Abstract Si confrontano quattro test condizionati per dati di Poisson: il test bino- 
miale esatto, la sua approssimazione asintotica, un test Markov Chain Monte Carlo 
e un test di permutazione standard. Si osserva che il test di permutazione, pur non 
parametrico, ha un comportamento simile agli altri. Una giustificazione teorica di 
questo risultato sta nell’osservare che le orbite di permutazione costituiscono una 
buona partizione dello spazio condizionato. 


Key words: Algebraic statistics, Conditional test, Permutation test, Poisson data 


1 Introduction 


We address the problem of comparing the means of two Poisson distributions with 


unknown parameter À;, i = 1,2. We consider two independent samples, yi") = 


(Y1,..-,Yn,) of size nı from Poisson(41) and yw) = (Ynj+1,---sYnj4n) Of size n2 
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from Poisson(A,). Then we use the joint sample Y = cy" ) y)) to perform the 
test Ho : Ay = Az against H; : Ay # Ad. 

The problem has been extensively studied in the literature. Among the several 
testing procedures available to researchers, we consider conditional tests, i.e. tests 
that are performed considering only samples Y such that the sum Y, of their ele- 
ments is equal to the sum Yops,+ of the elements of the observed sample yoy; 


ny+ng ny+ng 


Y= £ Y; = L Yi,obs = Yobs,+- a) 


i=1 i=1 


A justification for this choice is that, if we assume that the model for the means 
of the two distributions is the standard one-way ANOVA model, which according to 
[6] is log(A;) = Bo + Bix; with x; = Lif 1 < i< nı and x; = —1 ifn +1 <i <n +m, 
the statistic T = Y, = ye Y; is sufficient for the population constant By, which 
is the nuisance parameter of the test. 

For the sake of simplicity we denote the sum of the observed sample y0ps,+ by 
t and the set of the samples Y which satisfy (1) by ¥;. We refer to F, as the fiber 
corresponding to t. We focus on four conditional tests: 


1. the exact binomial test by Przyborowski and Wilenski [8]; 

2. an asymptotic version of the exact binomial test [8], which is based on the normal 
approximation of the binomial distribution [4]; 

3. a Markov Chain Monte Carlo testing procedure which exploits Markov basis [3] 
and the Metropolis-Hastings algorithm [9]; 

4. a standard permutation test [7]. 


In Section 2 we briefly describe the structure of the tests under study. In Section 3 
we compare the effectiveness of the tests through a simulation study and in Section 4 
we analyse the link between fibers and permutations from a theoretical perspective. 
Conclusions are in Section 5. 


2 Conditional Tests 


Exact and Asymptotic Conditional Binomial Test 


It is well-known that the distribution of the sum of n independent Poisson vari- 
ables of mean 4 is a Poisson variable with mean nd. Then it can be shown that 
the distribution of the variable 7;|T =, i.e. of the variable T} = bee) Y; condi- 
tioned to T = yee Y; = t, is a Binomial distribution with probability of suc- 
cess 0 = (n1A1)/(n1A + n242) and ¢ trials. It follows that under Ho : A, = A2 
the variable 7;|7 = t follows a binomial distribution with probability of success 
o = nı /(nı +m) and t trials. If tı is the observed value of T, the p-value is com- 
puted as 

min{2min{p(T < t1), p(T > t1)}, 1} (2) 
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where p(T; < t1) = Lio (4) 5 (1 — 80) * and p(T > t1) = Dkr, (4) 060 — 0). 
The asymptotic version of the conditional binomial test uses the asymptotic test 
Statistic 
6-0 x 
z=———°_  «N(0,1) where 0 = T /n,. 
O0(1 — 8) /n 
The p-value is computed as 2 « (1 — ®(|zZ,4s|) where ® is the cumulative distribution 
of the standard normal variable and z,,; = (tı /nı — 00)/V @0(1 — 00 


The Markov Chain Monte Carlo Test 


As mentioned above we condition on the sum t of the elements of the observed 
sample yops and we explore the fiber 


ny+ng 


FAV oe e es L Y; =t}. (3) 


To explore the fiber 7, as defined in (3) we set up a connected Markov chain by 
means of a Markov basis, i.e. a set 4 of moves which have to be added/subtracted 
to the vectors in Z, in order to move on the fiber (see [3] for a formal definition 
of Markov Basis). This basis can be found using the 4t i2 software [10] or, in this 
specific case, simply by induction on the sample size N = nı ` m. We get that Z is 
made of N — 1 moves my = (1, 6,,y,.--,6v-1,u),U = 1,.. — 1 where ô, p = —1 
if a = b and 0 otherwise. 4 allows us to build a graph over i fiber, where each 
pair of vectors y,x € F, is linked by an edge if a move m € 4 exists such that 
y =x+m. An example when ¢t = 6 and N = 3 is shown in Figure 1. 

Under Ho : A; = A2 = A we exploit the Metropolis Hastings algorithm (an accel- 
erated version as in [1], [2]) to modify the transition probabilities and grant conver- 
gence to 


Ay AN x Sl 1 
ply) =e 4*—.....e% i =e NA E =CII I] (4) 
VI! YN: I= yi! Di DI 
where C = e -NZ7!. At each step if we are in state y we select a random move 


my € & and we consider every possible transition y+ y: my with ye IT = 
{ve Z: y +y my € F;} = [-y1,yu41]Z. We move to y+ y* -my with y* ran- 
domly drawn from the set above with probability 


ply +y mu) 1 
Yer P(yt+y-mu) (+7)! 00417)! 


This walk on F, allows us to build an approximation of the distribution, under Ho, 
of the test statistic W = Ñ —Y = T, /nı — T2 /m. Finally the p-value is computed as 


#(|W| 2 |woosl) 


M (5) 
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Fig. 1: Graph on the fiber 7, with t = 6 and N = 3 


where M is the number of transitions and wops is the observed value of W. 


Permutation Test 


We perform a standard permutation test [7], randomly selecting M permutations of 
Yobs (M is at least 1,000), computing the corresponding values of W and the p-value 
as in (5). 


3 Simulation Study 


We consider 27 scenarios that have been built taking three different sample sizes 
(n1,m) (Table la) and, for each sample size, nine different population means 
(41,42) (Table 1b). 

For each scenario 1,000 samples have been randomly generated. For each sample 
the corresponding p-values for the four testing procedures under study have been 
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k Bo 3 1 2 3 4 5 6 7 8 9 
n 3 8 35 A 05 05 05 1 1 1 5 5 
m 17 12 15 Az 0.5 0.75 1 1 KS: 12; 5 7.5 10 
(a) Sample sizes (b) Population means 


computed. Specifically for the MCMC test 10,000 moves after the 1,000 used for 
the burn-in step have been used. For the permutation test 2,000 permutations have 
been used . 

We summarise the most important results: 


e the behaviour of the binomial tests (exact and asymptotic) looks different from 
the behaviour of the Monte Carlo tests (MCMC and permutation). This difference 
is due to the non-equivalent definitions of p-value ((2) and (5)) and, possibly, to 
the sampling of the fiber; 

e the significance values achieved by the permutation test are almost equivalent to 
the ones achieved by the MCMC test although this test explores a much wider 
sample space. We discuss this point in Section 4. 


4 Fiber and Permutation Sample Space 


The permutation operator does not alter the sum of entries. Hence the orbits of 
permutations y, where y is the generating vector, are subsets of the fiber. The orbits 
do not intersect and then we can create a partition of .#, made up of part(t, N) orbits 
Ty, where part(t, N) is the partition function defined in [5]. 

In the same orbit, p(y) is constant and then the probability of taking y € My is 
P(Ty) = Zyren, P(Y*) = #7: p(y) = #1 CT, ui where #7, is the cardinality of 
Ty. It can be proved that C, the normalizing constant defined in (4), can be computed 
as C = (Lnc, #Ty m i 3 an expression that does not contain the unknown 
parameter A = A = Ap. f 

As an example let us consider the fiber in Figure 1. It can be partitioned into 


part(6,3) = 7 orbits. We get C = 80/81 and we can compute the probability of each 
orbit 


y p(y) #Ty p(Ty) 
(6,0,0) 80/(81 -6!0!0!) 3 3/729 
(5,1,0) 80/(81-5!1!0!) 6 36/729 
(4,2,0) 80/(81-4!2!0!) 6 90/729 
(3,3,0) 80/(81-3!3!0!) 3 60/729 
(3,2,1) 80/(81-3!2!1!) 6 360/729 
(4,1,1) 80/(81-4!1!1!) 3 90/729 
(2,2,2) 80/(81 -2!2!2!) 1 90/729 
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The partition of F, into permutation orbits looks somehow optimal, because we 
can approximate well the fiber with one orbit if its probability p(7y) is large enough. 
This result is confirmed in Figure 1. If we select nı = 2 and nz = 1 and we compute 
the exact null cumulative distribution of W over Fg and its approximation using 
the orbit 71,23) (which has the highest probability), we obtain two distributions 
which are considerably close, even if the cardinality of the selected orbit is low 
(#71,2,3) = 6) compared to the the cardinality of Fe, which is 28. 


Table 1: Cumulative distribution of W on Fe and 71,23) 


w -6 -4.5 -3 -1.5 0 1.5 3 
Fe 0.001 0.018 0.100 0.320 0649 0912 1 
Tazz 0 0 0 0.333 0.667 1 1 


5 Conclusion 


This study can easily be extended to the non-negative discrete distributions of the 
exponential family. The convergence of the MCMC to the exact binomial and a 
mathematical statement on the optimality of the partition of the fiber into orbits of 
permutations are part of our ongoing research. 
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Non-parametric micro Statistical Matching 
techniques: some developments 


Tecniche micro non-parametriche per Statistical 
Matching: alcuni sviluppi 


Riccardo D’ Alberto and Meri Raggi 


Abstract Sometimes, the integration of different data sources is the only suitable 
solution to microdata shortage. Among the several data integration methodologies, 
Statistical Matching (SM) imputation allows to integrate different datasets when the 
same records are not uniquely identifiable through the observed variables and/or 
beyond a modelled rescaling procedure from an observed sample. Particularly, non- 
parametric micro SM imputation (“hot deck”) techniques allow researchers both to 
work always with observed (real) data and to avoid model misspecification bias. 
Nevertheless, non-parametric methods still lack a proper theoretical formalization 
and a sound methodology to evaluate the imputation quality. Therefore, we propose 
new combinations of distance functions and “hot deck” techniques, analysing how 
they perform in different donor-recipient datasets scenarios and elaborating a robust, 
recursive strategy for the imputation validation. 

Abstract L’integrazione di diverse fonti di dati può risultare a volte la sola 
soluzione percorribile alla mancanza di microdati. Tra le molteplici metodologie, 
l’imputazione tramite Statistical Matching (SM) permette di integrare dataset dif- 
ferenti quando per gli stessi record non sono disponibili variabili identificative e/o 
al di là di un modello di re-scaling del campione osservato. Nello specifico, le tec- 
niche (“hot deck”) micro non-parametriche, permettono sia di lavorare sempre con 
dati osservati (reali) sia di evitare l’errore da misspecificazione del modello. La 
loro incompleta formalizzazione teorica e la mancanza di una strategia per la val- 
idazione della bontà dell’imputazione sono l’oggetto del presente lavoro. Infatti, 
proponiamo nuove combinazioni di tecniche “hot deck” e funzioni di distanza, anal- 
izzando le loro performance in differenti scenari ricevente-donatore ed elaborando 
una strategia per la validazione dell’imputazione. 


Key words: statistical matching, imputation, hot deck techniques 
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1 Introduction 


Nowadays researchers are experiencing both a relevant increase of privately col- 
lected data sources (e.g. big data, ad hoc project surveys, etc.) and a simultane- 
ous reduction of official administrative data sources. Nevertheless, despite the wide 
range of possibilities these data offer, researchers cannot always neglect microdata 
which are sometimes essential for the research. Their shortage is due to several is- 
sues; whole official administrative data sources are often unavailable for research 
purposes and/or hardly accessible whereas their release is ruled by strict procedures 
which reduce variables’ informative power due to privacy claims restrictions. More- 
over, the collection of the same amount of information through ad hoc surveys is 
extremely expensive and long lasting. 

A reliable solution then, is to resort to data integration methodologies among 
which, Statistical Matching (SM) imputation techniques (both parametric and non- 
parametric) have gained a relevant attention in the most recent years. They allow 
different datasets integration when the same records are not identifiable through a 
unique observed variable and/or beyond a modelled rescaling procedure from an 
observed sample (as it is instead applying, respectively, Record Linkage and the 
Statistical Up/Downscaling methodologies). 

SM techniques have been properly formalized through a rigorous theoretical 
framework by Réissler [4] and D’Orazio [2]. At the best of our knowledge, these 
two works constitute the most relevant and complete references for the whole state 
of the art in SM imputation. Nevertheless, if the parametric framework has been 
carefully studied and developed, non-parametric methods still lack a proper theoret- 
ical formalization and a sound methodology to evaluate the imputation quality. 

In this paper, we focus on non-parametric micro SM imputation (“hot deck’’) 
techniques since they allow researchers both to work always with observed (real) 
data and to avoid model misspecification bias. Indeed, if the parametric approach 
requires the specification of the variables’ distribution family and of an imputation 
model, “hot deck” techniques “allow researchers to handle the missing data issues 
by replacement” [3] from the most similar observed unit. 

We explore new combinations of distance functions within the “hot deck” tech- 
niques matching algorithms, analysing how these combinations perform in different 
donor-recipient datasets scenarios and elaborating a robust, recursive strategy for the 
imputation validation. In Sect. 2 we describe the state of the art in SM imputation 
w.r.t. the non-parametric techniques and the proposed distance functions whereas, in 
Sect. 3, we propose our developments and the results achieved through a simulation 
study. 


Non-parametric micro Statistical Matching techniques: some developments 341 


2 State of the art in SM imputation 


There are four non-parametric micro SM imputation techniques, i.e. the Nearest 
Neighbour Distance Hot Deck (nnd), the Constrained Nearest Neighbour Hot Deck 
(nndc), the Random hot deck (rnd) and the Rank hot deck (rnk) [2]. 

For sake of simplicity, we define a basic imputation context of two different 
datasets, the recipient (R) and the donor (D) ones. Let be i and j two different units 
with i= 1,...,ng and j =1,...,np. Defining X = {X,,1= 1,...,L} the set of com- 
mon variables between R and D, we have that X/* is a vector of dimension (np x 1) 
and X,P is a vector of dimension (np x 1). 

Assuming that L = 1 so that X is a single (continuous) variable, defining i the 
recipient unit in R and j* the donor unit in D chosen to be matched (i.e. to con- 
stitute a matching unit pair) among all the units j, from [2], we know that the nnd 


technique associates units pairs in the way that the equation dijs = 


np hë- x? | holds, where d is the difference in absolute value between the 


two units i and j (j*), always computed such that 1 < j < np. 

This technique can also be sharpened in the nndc technique by imposing a con- 
straint by minimizing the function Y;*, 1?  (di;@;;), where @;; € {0,1} represents 
the matching unit pair of i and j such that i; j is equal to 0 if they are matched, equal 
to 1 otherwise. Two conditions have to hold in order to use just once each donor unit 
jin the setting up of a matching unit pair with a recipient unit i, i.e.: LE 10ij= 1 
and VIE QO); <1. 

The rnd technique constitutes a matching unit pair by picking at random the 
donor units. It can be sharpened in several ways but, for sake of brevity, we present 
only the easiest one, i.e. by building donation classes. Indeed, if the usual possible 
set of donor and recipient units pairs is defined by np”*, it is possible to define 
within the chosen matching variables some homogeneous subsets. Let be X; and X2 
two existing common variables between R and D upon which we can constitute a 


donation class, the possible set of units pairs is reduced to (nk, ye + (nb), 

We stress that it is possible to build donation classes also w.r.t. the nnd and 
nndc techniques, improving the imputation quality but decreasing the computational 
speed. 

Finally, the rnk technique works in two recursive steps. Firstly, it ranks recipient 
ang donor units w.r.t. their empirical RA distribution functions Fyr(x*) = 
T LER I(x; < x) and Fyp(x?) = = I(x; < x), being J the set of indices of 
xi <x and x; < x, respectively. 

Secondly, rnk associates to each recipient unit a donor unit in the way that the 
following equation | Fyr (xÈ) — Fyo (x2) = Minj=1,...np [Fe (xÈ) — Fyn (x?) holds, 
where the minimum of the distance between Fyr (x?) and Fyn (x?) is computed such 
as1<j< np. 

At the best of our knowledge, nnd, nndc and rnd techniques work by applying to 
their matching algorithms a default distance function (the Manhattan one) whereas 
not so much is known both about their performances in different recipient-donor 
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datasets scenarios and w.r.t. the quality of the imputation. We stress that what is 
mainly known, is derived from the parametric framework as a sort of “prescription” 
we summarize in the following sentences: i. being equal the dimensionality ratio 
between R and D, their variability is crucial, i.e. when the variance of the matching 
variable(s) in R is lower than the variance of the matching variable(s) in D, this 
condition is always preferable; ii. if, instead, the variance of the matching variable(s) 
in R is higher than the variance of the matching variable(s) in D, the condition of 
the widest dimensionality ratio between R and D is always preferable; iii. being 
different the dimensionality ratio between R and D, the key “assumption” is “the 
biggest, the best”, i.e. the choice of the recipient and the donor datasets has always 
to respect the condition nr < np; iv. donation classes always benefit the imputation 
quality. 

Studying new combinations of “hot deck” techniques and the Manhattan (mn), 
Mahalanobis (ms) and Exact (e) distance functions (for all the details we refer to 
[1]) we formalize and validate the above-mentioned expectations, also proposing an 
imputation validation strategy. 


3 Our proposal: a simulation study 


The simulation study is based on two steps, a previous R and D simulation and 
a consequent imputation. R and D are characterised in several scenarios w.r.t. the 
different dimensionality ratio, the different variability of the matching variables and 
the SM imputation running both with and without donation classes. 

For both R and D we simulated two sets of common variables: X^ = {X#, X$, X4} 
(used as matching variables) and X? = {X B x} (used as imputation variables). 
These latter are simulated as the realization of a log-Normal(1, 0?) multiplied for a 
Bernoulli(@). xi is simulated as the realization of a Bernoulli(0); xi is a categorical 
variable indicating the main variable value between X? and X; X4 is simulated as 
the sum of the values of the variables X: B and xB . The simulation scheme (for more 
details we refer to [1]) is shown in Table 1. 


Table 1 Simulation study and imputation scheme 


Simulation Nr. 1 2 3 4 | 
Ratio 1 to 10 1 to 10 1 to 3 1 to 3 | 
Variability var(R) > var(D)|var(R) < var(D)|var(R) > var(D)|var(R) < var(D)| 
Imputation Nr. 1 2 3 4 5 6 7 8 | 
Donation classes|with| without |with| without |with| without |with| without | 


We propose an imputation quality validation strategy upon three tools: i. the dis- 
tributions checking of both the variables originally present in R and imputed from 
D; ii. the distributions checking of the differences between the values of the orig- 
inal variables in R and the values of the imputed variables from D (defining these 


“99 


differences “z” variables); iii. the valuation of the MSE of the “z” variables. 
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We refer to [1] for the detailed description of the whole simulation results re- 
ferred to each R-D scenario, applying each one of the proposed tool. Here, we dis- 
cuss the main simulation results only w.r.t. the valuation of the MSE of the “z” 
variables. 

Firstly, a wider dimensionality ratio between the R and D is determinant when the 
variance of the matching variables in R is higher than the variance of the matching 
variables in D, as Table 2 shows. 


Table 2 MSE values of differences z (imputations 1, 2, 5, 6) 


donation classes no donation classes 
1 to 10 1 to 3 1 to 10 lto3 
var(R) > var(D) var(R) > var(D) 
Imputation 1 Imputation 5 Imputation 2 Imputation 6 
ZxB xB ZxB xB ZxB x ZxB x 


nnd.mn | 101.536) 9.617} 102.534| 10.017| 176.171] 83.896| 182.890| 90.273 
nnd.ms | 101.536| 9.617} 102.534| 10.017| 176.171| 83.896| 182.890| 90.273 
nnd.e |1,972.411/136.508|2,113.379|121.772|1,850.420/180.590|2,047.865|187.587 
nndc.mn| 101.527| 9.608} 102.679) 10.293| 175.903| 83.628) 183.459} 90.858 
nndc.ms| 101.526] 9.606} 102.815} 10.368| 176.010] 83.734| 183.573| 90.964 
nndc.e |2,688.750|139.780|2,728.813|131.305| 108.465} 14.920| 108.465) 14.920 
rnd.mn |1,000.011| 15.570|1,186.610| 19.674|1,253.199| 85.351|1,192.059| 73.047 
rnd.ms |1,005.479| 17.575|1,121.168| 16.839|1,257.923| 90.165|1,465.474|105.852 
rnd.e 1,794.635|127.224|1,756.882|137.068|1,798.596|182.784| 1,883.323| 164.871 
rnk 165.375| 45.464| 133.446| 23.293| 281.824|167.775| 203.317| 99.555 


Secondly, being the dimensionality ratio between R and D equal, the lower vari- 
ance of the matching variables in R w.r.t. the variance of the matching variables in 
D, is always determinant, as Table 3 shows. 


Table 3 MSE values of differences z (imputations 1, 2, 3, 4) 


donation classes no donation classes 
1 to 10 1 to 10 
var(R) > var(D) | var(R) < var(D) | var(R) > var(D) | var(R) < var(D) 
Imputation 1 Imputation 3 Imputation 2 Imputation 4 
ZyB ZyB ZyB ZyB ZyB ZyB ZyB ZxB 


nnd.mn | 101.536 5617 9.532 5528 176.171 83.896} 77.918} 77.904 
nnd.ms | 101.536] 9.617 9.532| 9.528} 176.171) 83.896} 77.918] 77.904 
nnd.e |1,972.411|136.508| 444.579) 157.936] 1,850.420} 180.590) 786.865|208.549 
nndc.mn| 101.527) 9.608 9.466} 9.465) 175.903) 83.628} 84.813] 84.770 
nndc.ms| 101.526] 9.606 9.494) 9.492} 176.010) 83.734) 84.515} 84.474 
nndc.e |2,688.750|139.780} 343.698|163.905| 108.465} 14.920) 46.965) 37.842 
rnd.mn |1,000.011| 15.570 8.273} 7.295(1,253.199| 85.351} 78.321} 81.351 
rnd.ms |1,005.479| 17.575 9.421) 9.767|1,257.923| 90.165) 92.751} 88.203 
md.e 1,794.635|127.224| 407.317} 94.668] 1,798.596|182.784| 583.777) 121.647 
mk 165.375] 45.464|2,943.404| 98.975) 281.824] 167.775|2,963.817/ 160.906 
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Thirdly, we found evidence that a narrower dimensionality ratio between R and 
D, being the variance of the matching variables in R lower than the variance of 
the matching variables in D, can produce the best imputation results if the matching 
variables in D have a proper variability, as Table 4 shows. In other words, oppositely 
to “the biggest, the best” common prescription, the bond between R and D can be 
relaxed if the variance of the matching variable(s) in R is lower than the variance of 
the matching variable(s) in D, and if the variance of the matching variable(s) in the 
smaller of the two donor datasets is the widest one. 


Table 4 MSE values of differences z (imputations 3, 4, 7, 8) 


donation classes no donation classes 
1 to 10 1 to 3 1 to 10 1 to 3 
var(R) < var(D) var(R) < var(D) 
Imputation 3 Imputation 7 Imputation 4 Imputation 8 
xB xB ZxB xB xB xB ZxB xB 


nnd.mn 9.532} 9.528 7.872| 7.945) 77.918] 77.904} 87.838] 88.045 
nnd.ms 9.532] 9.528 7.872| 7.945| 77.918] 77.904) 87.838] 88.045 
nnd.e 444.579|157.936| 477.174]158.138] 786.865/208.549| 666.437|205.484 
nndc.mn 9.466} 9.465 7.867| 7.976} 84.813] 84.770} 95.708} 95.738 
nndc.ms 9.494) 9.492 7.913] 8.022} 84.515) 84.474) 77.219} 77.183 
nndc.e 343.698|163.905| 420.386/169.801) 46.965) 37.842} 46.965) 37.842 
rnd.mn 8.273} 7.295] 12.321) 16.484} 78.321| 81.351| 104.761} 99.260 
rnd.ms 9.421| 9.767 9.950] 16.915} 92.751) 88.203) 85.926} 87.745 
rnd.e 407.317| 94.668| 573.707|106.418| 583.777|121.647| 334.443| 76.499 
rnk 2,943.404| 98.975|2,834.001| 86.592/2,963.817|160.906|2,953.937|143.025 


Therefore, all the “prescriptions” from the literature were tested and validated 
with the remarkable exception of the “the biggest, the best” one. Rather than it 
is commonly thought and prescribed, this condition is not mandatory and can be 
relaxed whenever either the variance of the matching variable(s) in R is lower than 
the variance of the matching variable(s) in D or, comparing two potential donor 
datasets, the variance of the matching variable(s) in the smaller of the two ones, is 
the widest. Further developments are already under studying, w.r.t. both a deepest 
theoretical formalization of the proposed combinations and in order to structure and 
elaborate more the imputation quality validation strategy. 
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Measuring tourism from demand side 
La misura del turismo dal lato della domanda 


Stefano De Cantis, Mauro Ferrante, Anna Maria Parroco 


Abstract This Paper proposes an analysis of tourism from the demand side, taking 
into account for both the total level of tourism demand produced by some European 
countries (domestic and outgoing) and its general tendency, and for the seasonal 
fluctuations which characterize many tourism-related aggregates. Tourist flows from 
the demand side at the European level are analyzed in the last decade, and a special 
focus on Italian tourism demand is provided, jointly with an analysis of its seasonal 
fluctuations. The analysis of general tendency of tourism demand and of the impacts 
of seasonality is a fundamental pre-requisite for the implementation of tourism poli- 
cies. 

Abstract // presente articolo propone un’analisi dei flussi turistici dal lato della do- 
manda, prendendo in esame sia il livello complessivo e la tendenza generale della 
domanda turistica, che il suo andamento stagionale. A tal fine vengono analizzati 
dati sulla domanda turistica a livello Europeo nell’ultimo decennio e viene pro- 
posto un focus specifico sulla domanda turistica in Italia, ponendo particolare en- 
fasi sull’andamento stagionale. Una maggiore comprensione delle dinamiche della 
domanda turistica nonché degli impatti della stagionalità rappresentano un pre- 
requisito essenziale per l’implementazione delle politiche turistiche. 


Key words: Tourism Statistics, Tourist behavior, Seasonal pattern, Seasonal ampli- 
tude 
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1 Introduction 


Country-specific measurement and the analysis of tourism activity of residents (both 
in terms of physical and economic volumes) allow us to describe the determinants 
of consumer behavior, to investigate motivations regarding the choices of making 
or not making tourism, and finally to estimate new tendencies in tourism related 
behaviors. In order to describe the main features of tourism demand at the coun- 
try level, it is important to consider: a) the total level of tourism demand produced 
by each country (domestic and outgoing); b) the general tendency which, in the 
medium run, characterizes the demand; c) the seasonal fluctuations of tourism de- 
mand. In particular, seasonality plays an important role, since the same level of 
tourism demand could determine very different impacts according to its distribution 
over time. Starting from these premises, the present work aims at analyzing tourist 
flows from the demand side at the European level. In the European context, the Eu- 
ropean Regulation (EU) No 692/2011 [3] concerning European statistics on tourism 
aims at establishing a common framework in the European Union, concerning the 
collection of statistical information on tourism. The Regulation states that data to be 
transmitted by the Member States concerns: a) the participation in tourism and the 
characteristics of tourism trips and visitors, and b) the characteristics of same-day 
visits. Collected and integrated data is provided by Eurostat in the tourism section of 
the database available at Eurostat’s website. Moreover, a special focus is given for 
the Italian case; thanks to the availability of micro-data on trips made by residents, 
the seasonal component of tourism demand is analyzed in detail and some synthetic 
measures of seasonal concentration are provided. 


2 Analysis of tourism demand in Europe 


In the European context, the level of tourism demand of residents in the European 
Union is largely determined by a relatively limited number of countries. Germany, 
France, UK, Italy, and Spain represent a relevant share of the EU’s population, 
and, beyond Scandinavian countries, also have among the highest rates of tourism 
propensity. Germany presents the highest tourism demand with more than 300 mil- 
lions trips made by residents in 2011, followed by France, the UK, Italy and Spain. 

In order to offer a wider perspective of tourism demand at the European Level, in 
Tab. 2, Net Travel Propensity Index is reported for a set of European Countries for 
2011, along with their number of trips and resident population values. Scandinavian 
countries (Norway, Sweden, and Finland) are those which present the highest Net 
Travel Propensity, but also Germany, with a value of about 60% in the third quarter, 
demonstrate a high travel propensity. On the contrary, countries such as Bulgaria, 
Poland, Portugal, and Romania are those in which the highest value of net travel 
propensity index does not exceed 25%. Moreover, although almost similar in terms 
of population, the five selected countries (Germany, France, UK, Italy, and Spain) 
are very different in terms of travel propensity (Tab. 1). First, European Countries 
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Table 1 Quarterly Net Travel Propensity Index for selected European countries (data in thou- 


sands), 2011 


Quarterly number of resident travelers 


Net Travel Propensity Index 


Country Population 

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 
Bulgaria 7,369 369 668 1,170 624 5.0% 9.1% 15.9% 8.5% 
Czech Republic 10,487 3,627 4,487 4,596 2,802 34.6% 42.8% 43.8% 26.7% 
Denmark 5,561 2,560 2,736 3,181 2,726 46.0% 49.2% 57.2% 49.0% 
Germany 81,752 32,897 42,750 47,208 39,569 40.2% 52.3% 57.7% 48.4% 
Greece 11,123 1,019 1,824 3,253 1,287 9.2% 164% 29.2% 11.6% 
Spain 46,667 7,902 11,478 15,624 9,035 16.9% 24.6% 33.5% 19.4% 
France 64,979 21,012 26,550 34,169 23,560 32.3% 40.9% 52.6% 36.3% 
Croatia 4,290 755 1,013 3521 955 7.6% 23.6% 35.5% 22.3% 
Italy 59,365 9,575 10,140 21,095 8,113 6.1% 17.1% 35.5% 13.7% 
Latvia 2,075 317 484 623 353 5.3% 23.3% 30.0% 17.0% 
Lithuania 3,053 573 911 ;210 761 8.8% 29.8% 39.6% 24.9% 
Luxembourg 512 214 267 318 228 41.9% 52.2% 62.1% 445% 
Hungary 9,986 1,584 2,067 2,787 1,886 5.9% 20.7% 27.9% 18.9% 
Malta 415 72 78 117 56 7.4% 18.8% 28.3% 13.5% 
Austria 8,375 2,528 3,342 4,652 2,854 30.2% 39.9% 55.5% 34.1% 
Poland 38,530 4,610 5,830 9,560 4,807 2.0% 15.1% 24.8% 12.5% 
Portugal 10,573 1,017 1,423 2,508 1,641 9.6% 13.5% 23.7% 15.5% 
Romania 20,199 2,123 2,918 3,348 3,212 0.5% 144% 16.6% 15.9% 
Slovenia 2,050 447 650 +079 476 21.8% 31.7% 52.6% 23.2% 
Slovakia 5,392 1,374 1,450 2,375 1,423 25.5% 26.9% 44.0% 26.4% 
Finland 9,379: 3,026 3,202 3,583 3,090 56.3% 59.6% 66.7% 57.5% 
Sweden 9,416 5,411 6,360 6,871 5,883 57.5% 67.6% 73.0% 62.5% 
United Kingdom 63,023 13,791 22,868 29,910 18,152 21.9% 36.3% 47.5% 28.8% 
Norway 4,920 2,672 3,016 3257 2,562 54.3% 61.3% 66.2% 52.1% 


exhibit different levels of tourism demand, both in absolute and in relative terms. 
The causes of these differences in tourism propensity should be more deeply inves- 
tigated, and can be related to economic and sociocultural differences, and with the 
country specific uses and habits related to tourist behaviors, which only partially 
have been discussed in academic literature [1, 2]. Second, the way in which tourism 
demand by residents is distributed during the year is also very different from one 
country to another with different seasonal variations both in terms of pattern and 


amplitude [4]. 
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Fig.1 Quarterly number of trips made by Residents in top five travel generating European coun- 
tries, actual and moving averages series (data in thousands), 2003-2011. 
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3 Tourism demand of Italian residents from Istat ‘Trips and 
holidays’ survey 


Istat, until 2014, annually presented the estimates of the main aggregates of tourism 
demand in Italy, based on CATI survey ‘Trips and Holidays’ which had been con- 
ducted on a quarterly basis from 1997 until 2013. Starting from 2014 a new survey 
on ‘Consumption of Italian Families’ has been introduced, which replaces and up- 
dates the survey on ‘Trips and Holidays”. In Tab. 2, annual data on trips (by purpose 
and destination) made by Italian residents are reported, from 2005 to 2013. Also the 
number of nights spent, average length of trip, and the gross travel propensity index 
are reported. This data highlights the decrease in the number of trips and nights reg- 
istered in the last years: from about 123 million trips in 2008 to less than 65 million 
in 2013, with a loss of about 60 million trips (-48.6%). Also in terms of nights, a 
considerable decrease can be observed: from more than 706 million nights in 2008 
to about 417 million in 2013, with a loss of about 289 million nights (-41.0%). In 
other words, if in 2008 there were about 210 trips per 100 residents, in 2013 there 
were only 106. From this preliminary analysis, the strong effect of the economic 
crisis is evident on tourism, representing a relatively less-investigated phenomenon. 

In order to isolate the seasonal component from other potential sources of vari- 
ability in the series several methods could be used. For our application, we imple- 
mented the TRAMO-SEATS procedure, which allows seasonal factors to be derived. 
A summary of SARIMA models estimated for each series is reported in Tab. 3. Ad- 
ditive seasonal factors were produced for the series related to propensity indices 
related to both holiday and business trips in Italy, whereas multiplicative seasonal 
factors were derived for the series related to holiday and business trips abroad. 

Once seasonal factors are derived, the cycle plot is a useful tool to synthesize the 
seasonal behavior of the series over all the considered periods. The series of seasonal 
factors related to travel propensity in Italy for holiday purposes seems to be one that 
presents a very high degree of stability of pattern of seasonality. Both the series 
related to holiday trips, in Italy and abroad, present almost the same pattern with a 
peak in August, and values of seasonal factors above the trend-cycle component only 
in summer months (from June to September). Several measures can be used in order 
to summarize the amplitude of seasonal fluctuations, some of which are reported in 
Tab. 4. All the amplitude measures indicate a relative stability of seasonal amplitude 
with a slight decrease of inequality in the last considered years. 

Tab. 4 reveals that the overall level of seasonality in each series is relatively stable 
during all the years considered, with a slight decrease in seasonal amplitude for the 
series related to holiday trips in Italy, holiday trips abroad, and business trips in 
Italy, and a strong irregular behavior for the series related to holiday trips abroad. 
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Table 2 Number of trips, nights and related index of trips characteristics made by Italian residents 


in Italy and abroad, 2005-2013 


Variable 2005 2006 2007 2008 2009 2010 2011 2012 2013 
Data in thousands 

Holiday trips in Italy 77,860 78,606 80,972 90,463 82,265 71,926 59,807 54,733 46,062 
Holiday trips abroad 14,268 15,284 16,201 16,347 16,412 15,524 12,751 13,967 11,389 
Total trips (all purposes) 107,100 107,895 112,369 122,938 114,099 100,040 83,417 78,703 63,154 


Nights spent in Italy (holiday 493,775 534,672 480,724 508,722 497,230 467,796 


purposes) 
Nights spent abroad (holiday 123,003 133,119 14 
purposes) 


Total nights (all purposes) 676,243 719,763 68 
Population (1 Jan) 57,875 58,064 5 
Avg. duration of trips in Italy 6.34 6.80 
Avg. duration of trips abroad 8.62 8.71 
Avg. duration of trips 6.31 6.67 
Gross travel propensity index in 1.35 1.35 


Italy (holiday purposes) 

Gross travel propensity index 0.25 0.26 
abroad (holiday purposes) 

Travel propensity index (all 1.85 1.86 
purposes) 


6,267 


9,313 
8,224 


5.94 
9.03 
6.13 
1.39 


0.28 


1.93 


135,374 


706, 650 
58,653 


5.62 
8.28 
215 
1.54 


0.28 


2.10 


393,015 357,772 300,709 


125,351 118,250 101,757 113,828 100,946 


6.57 
7.98 
6.33 
1.01 


0.21 


680,215 626,990 527,811 
59,001 59,190 59,365 

6.04 6.50 

7.64 7.62 

5.96 6.27 

139 122 

0.28 0.26 

1.93 1.69 


1.41 


501,059 417,126 
59,394 59,685 


6.54 6.53 
8.15 8.86 
6.37 6.60 
0.92 0.77 
0.24 0.19 
1.33 1.06 


Table 3 Summary of TRAMO-SEATS procedure on monthly travel propensity index series in 
Italy, by purposes and destination. Years 2002-2013 


Holiday trips Business trips Holiday trips Business trips 
in Italy in Italy abroad abroad 
Transformation none none Log-transormation Log-transormation 
SARIMA Model (0,1,1)(0,1,1) (1,0,0)(0,1,1) (0,1,3)(0,1,1) (1,1,1)(1,0,0) 
Parameter 0, = —0.7240 di = —0.3629 0, = —0.8648 di = —0.4643 
estimates 01» = —0.8283 612 = —0.8435 0, = 0.1459 0, = —0.9900 
03 = —0.1378 dia = —0.4883 
01,3 = —0.7447 
8 a 
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Fig. 2 Cycle plots of seasonal factors for travel propensity index, holiday trips in Italy and abroad, 


2002-2013. 
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Table 4 Measures of amplitude derived from seasonal factors of Travel Propensity Index, by pur- 
pose and destination of trips, Italian residents 2002-2013 


Holiday trips in Italy Business trips in Italy Holiday trips abroad Business trips abroad 

Seasonal Range Seasonal Range Coefficient of Gini Coefficient of Gini 
Year Seasonal Variation Index Seasonal Variation Index 
2002 23,787.7 269.9 0.68 0.12 0.21 0.34 
2003 23,792.1 3268.3 0.67 0.11 0.20 0.33 
2004 23,773.5 5262.2 0.67 0.11 0.19 0.33 
2005 23,741.6 ;249.6 0.67 0.16 0.26 0.32 
2006 23,725.9 +233.9 0.66 0.20 0.32 0.32 
2007 23,699.0 32214 0.65 0.19 0.33 0.31 
2008 23,643.4 , 207.0 0.64 0.13 0.20 0.31 
2009 23,584.8 , 187.5 0.64 0.10 0.17 0.30 
2010 23,495.8 , 176.3 0.64 0.15 0.25 0.30 
2011 23,397.0 , 178.0 0.64 0.35 0.61 0.30 
2012 23,335.6 , 181.2 0.63 0.20 0.32 0.30 


2013 20, 708.7 958.8 0.71 0.30 0.38 0.24 


4 Conclusion 


European residents have very different tourism behaviour in terms of overall travel 
propensity and very different seasonal travel demands. Thanks to microdata ob- 
tained from the survey on tourism made by Italian residents, a detailed analysis of 
seasonality has been possible. Despite of the relevant decrease in tourism trips fol- 
lowing the recent economic crisis, a strong and persistent seasonal behaviour char- 
acterizes the pattern of tourism trips in Italy. In social phenomena it is common to 
observe changes in human behaviours and attitudes, but, quite surprisingly, this is 
not the case. From the methodological perspective, this work used the Gini index, 
as well as other indices for synthesizing the seasonal burden, which, as pointed out 
by De Cantis et al. [4], do not take into account for the natural ordering of months. 
Subsequently, a deep analysis on the causes that determine the persistent seasonal 
behaviour, as well as critical reflections on the appropriate measurement of seasonal 
fluctuations merits further investigations from a variety of perspectives. 
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Optimal Ethical Balance for Phase ITI Trials 
Planning 


Bilancio Etico Ottimale nella Pianificazione della Fase 
III delle Prove Cliniche 


Lucio De Capitani and Daniele De Martini 


Abstract The need of an ethical evaluation is mandatory for every clinical trial, 
Ethics Committees have been built for that. The distinction between individual and 
collective ethics has been introduced in a seminal work by Lellouch and Schwartz in 
1971, where individual ethics regard concerns related to the patients enrolled in the 
trial, and collective ethics those of the patients not enrolled who would benefit of a 
positive trial result. In this paper, a metrization of individual and collective ethics is 
proposed in order to evaluate their balance in a confirmatory clinical trial. The ethi- 
cal balance evaluation, among these two aspects of ethics, can be performed before 
trial starting in order to address sample size determination. The metrization is based, 
among other parameters, on the drug effect size, on the quality of life of patients un- 
der therapy or placebo, and of that induced by adverse reactions. Some numerical 
examples show that the optimal ethical balance can be provided by sample sizes far 
from those computed by adopting the usual paradigm based on the prefixed power 
of 80-90%. 

Abstract La valutazione etica nelle prove cliniche è assolutamente necessaria, e in- 
fatti ogni studio clinico viene sottoposto ad adeguato comitato etico. La distinzione 
tra etica individuale e collettiva è stata introdotta in un lavoro pionieristico da Lel- 
louch e Schwartz nel 1971, in cui i possibili danni subiti dai pazienti coinvolti nello 
studio vengono valutati dall’etica individuale, mentre l’etica collettiva considera 
quelli dei pazienti non coinvolti, che beneficerebbero di un positivo risultato dello 
studio clinico. In questo lavoro, si propone una quantificazione dell’etica individ- 
uale e collettiva al fine di valutare il bilancio etico in uno studio clinico conferma- 
tivo. La valutazione del bilancio tra questi due aspetti di etica può essere eseguita 
prima dell’inizio dello studio clinico al fine di calcolare la dimensione campionaria 
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ottimale. La proposta quantificazione dell’etica dipende dall’entita dell effetto del 
farmaco, dalla qualità della vita dei pazienti in terapia o placebo, e da quella in- 
dotta dai possibili effetti indesiderati. Attraverso degli esempi numerici si mostra 
che il bilancio etico ottimale pu essere raggiunto in corrispondenza di dimensioni 
del campione lontane da quelle calcolate adottando il paradigma usuale, in cui la 
numerosit campionaria viene scelta al fine di assicurare il raggiungimento di valori 
di potenza del test pari all’80-90%. 


Key words: Individual Ethics, Collective Ethics, Global Ethics, Assurance 


1 Introduction 


It is a common habit to prefix the power of phase III clinical trials at 80-90%. These 
power settings are suggested by the most relevant books on clinical trials method- 
ology (see, for example, [8] or [2]). Unfortunately, the practical choice of power is 
seldom motivated as to concern its clinical, and therefore ethical, impact. 

In fact, on the one hand 80% looks high enough to guarantee that if the new 
treatment is effective the trial will succeed - and the drug will be available for the 
ill population, with high probability. On the other hand, the power is often set not 
higher than 90% in order to minimize, for the enrolled sample of patients, risks 
and wastings due to potential harm and lack/or of efficacy of the new treatment. In 
other words, the power threshold of 80-90% seems accomplishing the need to affect 
on medical practices, being the drug effective, and that of preserve the safety of 
enrolled patients. 

These two concerns remind to the concept of individual and collective ethics, 
introduced by [7]. Originally, collective ethics (CE) concerned maximizing total 
group benefit, and individual ethics (IE) concerned maximizing the benefit of each 
person to be treated. 

However, it is not clear when adopting either 80% or 90%, and why. FDA and 
[12] encourage to set the power at 90%, but not for explicit ethical reasons: they 
argue to adopt this power to decrease the rate of unsuccessful trials. In fact, in the 
literature there are not precise indications about which power threshold should be 
adopted in different ethical situations. 

In the seminal paper [7] the concepts of IE and CE were introduced, together with 
some mathematical formulations of them under both fixed and sequential designs. 
[6] provide a critical history of IE and CE, and remark that “very little follow-up 
research in the lineage of the 1971 paper considers mathematical models”. 

The aim of this work is that of providing a model to quantify the ethical balance 
in fixed designs, which are the most widely adopted in phase II and phase III trials 
(see www.clinicaltrials.gov). Our perspective is in agreement with the view on the 
ethic of the trial proposed in [11], which “is dictated by the type of evidence sought 
and by balancing various costs of aggregate harm and benefit” (see [6]). 


Optimal Ethical Balance for Phase II Trials Planning 353 


In [4] it is recalled that Nuremberg Codes point 2 offers strong support of the 
existing connection between power and ethics, and they add: “the underlying prin- 
ciple is that for any given outlay of human risk or resources, there is an obligation 
to maximize the power and efficiency of the experimental design”. Here, we in- 
vert the point of view, by modeling ethics as a function of the sample size, and so 
of the statistical power. Then, we suggest to adopt the sample size that maximizes 
experimental ethics. 


2 Theoretical framework 


A two-arm parallel design with balanced sampling is considered for the phase III 
trial. A sample of size n is collected for each arm (i.e. new drug and standard treat- 
ment/placebo). The true, and unknown, proportions of healing are p, and pe. The 
statistical hypotheses are Ho : p; = Pc VS Hi : pi > Pc, and @ is the type I error 
probability. We assume that the proportions represent the responder rate under the 
two arms. 

Given that , n and Pn are the sample proportions, the test statistic is 7, = (Brn — 
Ben) /V(Pin(1= Pen) + Ben( — Ben))/2. The power function x(n) = P(T1>z1-a) 
is approximated by D(ES yn7/2 — z1-a), where ES is the standardized effect size: 


ES = (pi — pe)/V(p(1= Pr) + pe(1 = pe))/2. 


3 Modeling ethics 


Ethics is the sum of individual and collective ethical contributions. Usually, individ- 
ual ethics concern the sample of patients involved in the experiment, where collec- 
tive ethics consider the remaining population. 

Basically, ethics are computed through Benefit/risk indicators times the dura- 
tion of periods of interest. Quality of Life measures (QoL) are adopted together 
with the probabilities of responders and of harms, which are related to the “Num- 
ber Needed to Treat” (NNT) and “Number Needed to Harm” (NNH), all be- 
ing classical benefit/risk indexes (see, for example, [5]). In particular, the ethi- 
cal contribution of a group is given by the group size times the quality of life 
indicator times the duration of the period where such a QoL persists. Since the 
new treatment is available to the population if the trial succeeds, collective ethics 
are also multiplied by the power of the experiment. Thus, in general, Ethics are: 

E = group size x probability x quality of life x duration 
Individual ethics (ZE) concerns ethics of the population sample enrolled in the trial, 
and considers the placebo group just during the trial. JE depends on: the size of each 
sample (n), the rate of responder under new therapy (p;), the harm probability under 
new therapy (/p;), the rate of responder under control treatment (pc), the quality 
of life during the disease and before treatments (QoL,), the benefit (quality of life) 


354 Lucio De Capitani and Daniele De Martini 


after the new treatment (QoL,), the harm (risk) during the new treatment (QoL,), the 
benefit after control treatment (QoL,), life expectancy (Dz), duration of the therapy 
(D;n) and accrual rate (A,, which is assumed to be uniform during enrollment). The 
duration of phase III trial is: Dp3 = 2n/A; + Din- 

Under the new treatment, the ethical contribution of (eventual) QoL improvement 
of responders and non responders is the sum of the two, resulting: 


IEr(n) = nx ( pi x (QoL x (Di — (Dps (n) — Din) /2)4 
QoLa x (Dp3(n) — Din) /2) + (1 — pi) x QoLa x Di) 


Note that the duration of the benefit of responders is given by life expectancy minus 
the average of the time elapsed from the beginning of the trial and the end of the 
therapy. For non responders, the quality of life remains that of the disease during all 
life. 

The ethical contribution of harm due to the new drug is: 


IEry(n)=nx hp; x QoL, x Dip, . 


Under the placebo control there is no harm. The (eventual) QoL improvement of 
responder and non responder is evaluated just during the trial (at the end of phase III 
this group could be treated with the new therapy if the trial succeeds) and it results: 


IEc(n) =nx (De X (QoL, x Din + QoLa x (Dp3(n) — Din))+ 
+(1- pe) x QoLa x Dp3(n)) . 


Finally, IE results: ZE (n) = IEr (n) +IEry(n)+IEc(n) . 

Collective ethics (CE) concerns ethics of the population not involved in the trial, 
and that enrolled in the trial under the control treatment once the trial has been com- 
pleted. Besides the quantities already introduced to define IE, CE depends on: the 
population size (N), the incidence in the illness (Prev;), the power of the experiment 
a(n). The size of the ill population is Nj; = N x Prev; — n, that is the ill population 
minus the group that tested the new treatment in the trial. 

First, the “during trial” ethical balance of the population not involved in the ex- 
periment is: 

CEpr(n) = (N x Prev; — 2n) x QoLa x Dp3 . 


When the trial succeeds, the ethical contribution related to the benefit of the pop- 
ulation due to treatment, for responder and non responder, is: 


CEsr(n) = Nin X (pi x QoL; + (1— pr) x QoLa) x (Dr — Dp3(n) — Din) x n(n) . 


Note that the duration of the quality of life (benefit or not) is given by life expectancy 
minus the duration of the trial minus that of the therapy. In other words, it is assumed 
that the ill population adopts the new treatment whenever it is available, that is, just 
after the end of the successful trial. 
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The ethical contribution of harm due to the new drug in the ill population also 
accounts for the power of the experiment, resulting: 


CEru(n) = Nin x hp X QoL, x Din X T(n) . 


In case the trial is unsuccessful the harm is due to the loss of benefit, and the 
quality of life of ill popuplation remains the same: 


CEur(n) = Nin x QoLa x (Dr — Dps(n)x(1-7(n)) . 
The total collective ethical contribution results: 
CE (n) =CEpr (n) +CEsr (n) +CEry (n) +CEur (n) : 


The global ethical contribution of the experiment is given by the sum of individ- 
ual and collective ethics: 


GE(n)=IE(n)+CE(n) . (1) 


It is of interest to compute the sample size providing the best ethical balance, and 
then to account for the power this optimal sample size provides, which could not be 
in the range of the classical range of power adopted for planning phase III trials, i.e. 
[80% , 90%]. 


4 Examples 


Two numerical examples are reported here, based on the ethical model in (1). 

Let us consider first a situation where the new drug works well, and where side 
effects are low. We expect that the power at the phase III sample size giving the best 
ethical balance is high. 

We set the parameters as follows: œ = 2.5%, p, = 0.5, pe = 0.1, Prev; = 10%, 
N = IM, hp; = 0.05, QoLg = —2, QoL, = 5, QoL, = —5, QoL, = 0.5, Dr = 20, 
Din = 0.5, A, = 200. In this situation, if œ, p, and pe were considered only, standard 
sample size computation would give group samples of size 17 to achieve a power of 
80%, and of size 23 with power 90%. However, the sample size giving the optimal 
Ethical Balance, that is providing the maximum of GE(n), is argmax(GE(n)) = 
Nopt = 55, per group. The subsequent power is: 7(55) = 0.9956, meaning that the 
optimal power is quite higher than those usually adopted, viz. 80-90%. 

Now, consider a situation where the effect of the new drug is just moderate and 
the side effects are quite remarkable. Some of the above parameters are modified 
as follows: p; = 0.3, Prev; = 0.1%, hp; = 0.5, QoL; = 3, QoL, = —10, A, = 50. 
In this second situation, standard sample sizes per group would be of 59 and 79, 
with prefixed power of 80% and 90%, respectively. The optimal Ethical Balance is 
obtained here with samples of size argmax(GE (n)) = nopi = 50. The power function 
has now changed, since p; = 0.3. Consequently, the optimal power under the ethical 
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perspective is 7(50) = 0.7054. This means that when the effect size is low and there 
are considerable side effects, the optimal power can be quite lower than 80-90%. 
Often, the information on the parameters involved in the Ethical Balance is weak. 
To account for the possible deviation between the prefixed values of the parameters 
and their true value, a statistical distributions on parameters is usually introduced: 
this technique is called assurance (see [9]). We performed a sensitivity analysis of 
model (1) with assurance obtaining that, also in this case, the optimal power can be 
quite lower than 80-90%. These results are not reported here for the sake of brevity. 


5 Conclusions 


We have shown, through the model on individual and collective ethics we intro- 
duced, and a couples of appropriate examples, that defining the power within the 
range 80-90% can lead to poor choices in terms of the impact a new drug might 
have on the ill population. 

To conclude, we would like to merge ethical models, like the one here introduced, 
and economical models, such as those developed in [10], [1], and in [3], since we 
believe that even profit and cost have an ethical impact on our society. 
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Sampling schemes using scanner data for the 
consumer price index 
Schemi di campionamento per la stima dell’indice dei 


prezzi al consumo usando gli scanner data 


Claudia De Vitiis, Alessio Guandalini, Francesca Inglese and Marco D. Terribili 


Abstract The Italian National Institute of Statistics (ISTAT) is carrying out a 
redesign of Consumer Price Survey (CPS). The availability of Scanner Data (SD) 
from retail modern distribution, provided to ISTAT by Nielsen for a large number of 
stores selling food and grocery, is the starting point of this transformation. Indeed, 
SD represents a big opportunity for introducing improvements in the computation of 
Consumer Price Index (CPI). This work aims to study the properties of alternative 
aggregation formulas of the elementary price index in different sampling schemes 
implemented on SD. Bias and efficiency of the estimated indices are evaluated in a 
Monte Carlo simulation context. Finally, a comparison between a fixed and a 
dynamic approach in the compilation of the elementary price indices was performed. 
Abstract La disponibilità di dati scanner (SD) provenienti dalla grande 
distribuzione, che l’ISTAT acquisisce dalla Nielsen per un cospicuo numero di punti 
vendita (prodotti alimentari e per la casa), costituisce il punto di partenza per il 
ridisegno dell’indagine sui Prezzi al Consumo. Infatti, gli SD rappresentano una 
grande opportunità per l’introduzione di miglioramenti nel calcolo dell’indice dei 
prezzi al consumo (CPI). Questo lavoro si propone di studiare le proprietà di diversi 
schemi di campionamento implementati sugli SD con differenti formule di calcolo 
dell’indice elementare dei prezzi, valutando distorsione ed efficienza degli indici 
con una simulazione Monte Carlo. Infine, si presenta un confronto tra l’approccio 
fisso e dinamico per il calcolo dell'indice. 
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1 Introduction 


The Italian National Institute of Statistics (ISTAT) is carrying out a redesign of the 
Consumer Price Survey (CPS). The main aim of the project is to modernise the 
survey, improving and unburdening the data collection phase, together with the 
progressive introduction of more rigorous sampling procedures, probabilistic where 
possible, for the selection of outlets and products for the sectors where this is 
feasible (De Vitiis et al., 2015; Bernardini et al., 2016). 

The availability of Scanner Data (SD) from retail modern distribution, provided 
through an agreement with Nielsen, are the starting point of this transformation. The 
SD represents a big opportunity for introducing improvements in terms of both data 
collection and sampling perspective. 

Through a contract with Nielsen and an agreement with the six main retail chains 
operating in Italy, ISTAT started receiving, since the end of 2014, SD referred to 
food and grocery markets and treating them with the objective of experimenting the 
computation of the consumer price index (CPI). Scanner data files contain 
elementary information referred to single EAN codes! (European Article Number, 
GTIN) for specific outlets consisting of turnover and quantities sold during a week. 
This information does not provide the “shelf price” of the product individuated by 
the EAN code and outlet (reference or series), but allows to define a unit value or 
average weekly price. For reasons deriving from operational constraints of the 
productive process, a restriction is introduced regarding the observable weeks: only 
the relevant weeks are considered, defined as the first three full weeks (composed of 
seven days) in each month. Furthermore, usually SD do not include information 
about discounts or special sales. 

For an accurate computation of the price index over time, the use of high- 
frequency data, as are SD, requires considerable efforts in both data collection and 
estimation phases. Completeness and correctness of the data are two important 
pillars for a correct use of SD (Vermeulen and Herren, 2006; Van der Grient et al., 
2010); formal and quality checks on the data flow must be implemented. In the 
estimation phase some important drawbacks associated to SD, as a high attrition rate 
of products, the temporary missing products, the entry of new products and volatility 
of the prices and quantities due mainly to sales, need to be addressed from both a 
theoretical and a practical point of view. 

An important issue, out of the scope of this paper but crucial for the ISTAT CPI, 
is the necessity to combine estimates deriving from scanner data with the estimates 
that will continue to be produced by the current on field survey for the traditional 
retail distribution. 

The aim of this paper is to present the SD experimental framework in which, 
first, probability and nonprobability selection schemes of series and, further, 
different probability sampling designs are compared. In particular, some important 
results on the properties of the price index at the elementary aggregate level, 
calculated according to different formulas and various probability sampling designs, 


1 Nielsen provided also the dictionary for the classification of EAN codes to GS1-ECR-Indicod product classification, while 


ISTAT ensures internally the translation from ECR to COICOP, the classification of products used for the CPI. 


Sampling schemes using scanner data for the consumer price index 359 


will be presented. A further experiment is also summarized to highlight the 
differences between a fixed and a dynamic population approach in the construction 
of the price indices. 

The paper is organized as follows: section 2 describes the context and the 
methodological approach of experiments used to compare different sampling 
designs from SD; section 3 shows the most important results regarding accuracy of 
price indices estimates; in section 4 some conclusions and future developments are 
exposed. 


2 Context and methods for sampling scanner data series 


The experiments carried out so far on the scanner data aimed at evaluating the 
properties of the weighted and unweighted elementary price indices in different 
selection schemes of series (EAN and outlet codes). A series is a reference for 
which prices are observed during a certain period. In this phase of the experiments, 
the implications of life-cycle of series, seasonality issues and missing data have been 
not taken into account and a simplification have been used: only permanent series! 
are considered as universe for sampling and price index evaluation. A sample of 
series is selected at the beginning of the reference period and followed during all the 
year without considering either new entries nor discontinuities. 

For each selection scheme, starting from the monthly price ratios with fixed base 
(December 2013) available for 2014, the elementary price indices are calculated 
using three classic aggregation formulas: Jevons (unweighted), Fisher (ideal) and 
Lowe (weights from quantities of previous year). The choice of these indices has 
been made on the basis of theoretical and empirical considerations: Fisher ideal 
index is thus preferred by economic theory, it uses quantities in different times and 
allows for substitution effects. 

The experimental study has been developed in two phases: first, probability and 
nonprobability selection schemes of series have been compared; further, several 
sampling designs characterized by the use of different criteria of sample allocation, 
both for outlets and elementary items (EANs), and different selection methods of the 
sampling units, were considered. The comparison among the alternative selection 
schemes is made, for each price index, taking the corresponding true value of the 
index computed on the whole universe as a benchmark. Indices performance are 
evaluated in terms of bias for all selection schemes. For probability selection 
schemes, accuracy (bias and sampling variance) of the price indices have been 
studied with a Monte Carlo simulation: 500 samples are selected, according to 
different sampling designs. Indices variability and bias are computed on the 
estimated indices in the replicated samples. The sample selection and weighting of 
price indices is based on the total annual turnover of 2013. The analyses were 
conducted on SD relative to six retail chains (Conad, Coop, Esselunga, Auchan, 


1 Permanent series are referred to those references with not-null turnover for at least one relevant week (the first three full weeks) 


in each month of the considered year, starting from the December of previous year. 
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Carrefour, Selex) available in 2014 for Turin province and considering some 
consumption segments. 

Sampling designs — In the first phase, nonprobability sampling is carried out by 
selecting series on the basis of cut-off thresholds of covered turnover in previous 
year, 2013: two samples are formed with all the series covering respectively the 60 
and 80 percent of the total turnover in each of the considered consumption segment 
(coffee, pasta, mineral water) in the selected outlets. Moreover, considering the 
currently used fixed basket approach, a second selection scheme is defined selecting 
the most sold EANs for each representative product in the selected outlets. 
Nonprobability selection schemes are compared with two-stage probability sampling 
design, where primary stage units (PSU) and secondary stage units (SSU) are 
respectively outlets and EANs. The size of outlets sample has been fixed at a 
number of 30 out of 121 outlets available in SD. The sample size for SSU is fixed by 
a sampling rate of 5 percent of the number of EANs in each consumption segment in 
the sampled outlets. Outlets are stratified by chain and outlet type (hypermarket and 
supermarket). In each stratum, the sample has been allocated proportionally to the 
turnover. The selection of outlets is carried out in each stratum by simple random 
sampling (SRS), while the EANs are selected with probability proportional to size 
(PPS), in terms of total turnover of previous year, by adopting Sampford sampling 
(Sampford, 1967). 

In the second phase of the experiments the following sampling designs have 
been compared: 1) one stage stratified sample of EANs; 2) cluster sample of outlets; 
3) two-stage sampling with stratification of PSU (outlet) and SSU (EAN). For each 
sampling design the size of the final sample of EANs has been fixed in average at 
7,400 to compare the different sampling strategies on equal computational effort. 
Moreover, different criteria of sample allocation, both for outlets and EANs, and 
different selection methods of the units were considered. 

The first sampling design is carried out stratifying the EANs by market (ECR 
group) in each consumption segment (considering coffee, pasta, mineral water, olive 
oil, spumante and ice cream). Sample size is allocated among the strata through a 
Neyman formula, taking into account the variability of prices relatives in the 
markets observed in the reference year 2013. Two selection schemes have been 
considered, SRS and PPS. 

In the second design, cluster sampling, a sample of outlets (14 out of 121 outlets) 
is selected. Outlets are stratified by chain and type. In each stratum, two different 
allocation of outlets are tested: proportional to the strata turnover and optimal 
allocation (Neyman). Outlets are selected with both SRS and PPS methods. All the 
EANs in the selected outlets are included in the sample. 

Finally, two-stage sampling design is characterized by a stratification of both 
PSU and SSU. The stratifications adopted for the PSU and the SSU are the same of 
the two schemes described above. The size of the outlets sample has been fixed at a 
number of 30 out of 121 outlets. For both outlets and EANs, sample allocation in the 
strata is proportional to the strata turnover. PSU are selected with a PPS method, 
while SSU are selected both with SRS and PPS methods. 
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Unbiased estimators - The parameters of interest are monthly Jevons, Fisher and 
Lowe indices. Jevons index is an unweighted CPI that uses price information only (it 
assumes that expenditure shares remain constant), while Fisher and Lowe use also 
quantity information. Fisher and Lowe indices consider turnover shares at different 
time periods as weights (Gabor and Vermeulen, 2014). Indicating by the subscript t 
the current month (12 months in year 2014), to the reference month (December 
2013), / the previous year (2013), c (c=1,...,C) the generic homogeneous products 
group and m (m=1,...,M.) the series, unbiased sampling estimators of population 
parameters (elementary price indices aggregation) can be expressed as follows: 
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The qo measure refers to the m-th quantity series in the previous year / (2013). 


The weight wem is obtained as the inverse of the inclusion probability of the sampling 
unit deriving from the sampling design. 


3 Main results of the experimental phase 


The most meaningful results of the two experimental phases are shown in the 
following figures. Figure 1, from the first experimental phase, shows the level 
estimates of the monthly Jevons, Lowe and Fisher indices computed on probability 
(two stage sampling) and nonprobability samples and the true value (universe panel 
series SD, U) of the corresponding index for two consumption segments (coffee and 
pasta in Turin province). 
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Figure 1 — Jevons, Lowe and Fisher indices computed with different selection schemes of 
series for coffee and pasta segments, Turin province, year 2014 


The comparison between probability and nonprobability selection schemes 
shows a common evidence for both products: pps sample estimates of weighted 
index, Fisher and Lowe, results quite overlapped to the “true” value U; cut-off 
estimates over-estimate, but follow the trend for coffee, while for pasta are quite 
overlapped to true value U. Most sold item estimates under-estimate and alter trend 
for coffee with weighted indices but not for Jevons, while for pasta they show 
different trends for the three indices. The mean of sample estimates of Jevons index 
strongly over-estimates the “true” value U for coffee but not for pasta. These 
opposite performance for the two product can be explained by the different number 
of items and turnover distributions. In general, also from other evidences not shown 
for sake of brevity, (i) probability sampling always produce more accurate estimates 
than nonprobability selection scheme; (ii) sampling scheme is not neutral with 
respect to the choice of aggregation formulas; (ili) sampling error varies among 
consumption segments. 

Figure 2, from the second experimental phase, illustrates the difference among 
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the three indices estimated under two different sampling designs: cluster sample of 
outlets (with proportional allocation and PPS selection) versus two stage sample 
(proportional allocation and PPS selection of outlets and Neyman allocation and 
PPS selection of EANSs). 


Jevons 


“s Fisher 


— ite ——=0at35%  —Universe «=O I5% 


2 stages 


Figure 2 — Jevons, Lowe and Fisher indices for coffee segment estimated on one sample, 
confidence interval (CI) of estimates at 95% and true value (computed on the universe of SD). 
Turin, year 2014 


The comparison between two probability selection schemes highlights that all 
the estimates seems to catch properly the level and the trend of the related true 
index. The estimator of Lowe and Fisher indices have in both cases wider 
confidence intervals (CI) with respect to the Jevons index, due to the variability if 
quantities involved in the weights. In general the width of CIs are greater under the 
two stage sampling than under cluster sampling design (one stage), even if the 
difference does not seem so large. In fact, the intra-group correlation is negative, 
even if close to 0. SO, variability within the outlets seems slightly higher that 
between the outlets. However, this aspect needs of further studies taking into 
account also all the consumption segments. 


4 Concluding remarks and future developments 


The two experimental phases produced interesting results regarding the performance 
of sampling schemes and index formulas in a closed population context and fixed 
approach. They lead to the conclusion that probability sampling is the better choice 
in this context. 

The successive phase, currently in progress, regards the comparison between a 
fixed and a dynamic approach, the latter consisting in considering all series of an 
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open population (Ivancic et al., 2011). The elementary price indices are computed 
considering both closed and open population. Assuming a closed population, direct 
indices are built on a fixed basket of products defined at reference time, ignoring 
new products (fixed approach). In this context the indices are affected by shrinkage 
over time due to the attrition of products during the year. However, in reality many 
products disappear and new products enter continuously. By using chain indices the 
life cycle of products is taken into account as the basket of products changes months 
by months: the flexible basket is constituted by the matching products sold during 
two months in a row (dynamic approach). 

In order to evaluate the impact of the life cycle of products, direct and chain 
price indices are compared. For this purpose, an artificial population has been 
generated, with products appearing and disappearing (momentarily and 
permanently). Starting from a panel of products, new products have been introduced 
considering the monthly birth rates and old products have been removed in 
accordance to a survival function. Both monthly birth and survival rates have been 
estimated on the real open population. 

The construction of this artificial population is a trick enabling to evaluate the 
whole error, both sampling and non-sampling errors, due to appearing and 
disappearing products. 
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Interactive machine learning prediction for 
budget allocation in digital marketing scenarios 


Machine learning per l’allocazione del budget attraverso 
la previsione interattiva di scenari digitali di marketing 


Della Valle Ermelinda, Scardovi Elena, Iacobucci Andrea, Tignone Edoardo 


Abstract Scenario Analysis estimates the relationship between budget allocation and 
some typical digital analytics metrics quantifying the performances of marketing 
campaigns. The actual values of the monitored Key Performance Indicators, deriving 
from different big data sources, are compared to their estimated value after a variation 
in the investments. Our ensemble approach combines multivariate generalized linear 
models with machine learning models. R implementation is embedded into an 
interactive dashboard for providing real time score predictions. The visualization 
simplifies the business fruition of the analysis and encourages new and deeper 
experimentations with data. 

Abstract L'analisi di scenario stima la relazione tra l’allocazione del budget e alcune 
metriche tipiche della digital analytics che quantificano le performance delle 
campagne di marketing. Alcuni indicatori chiave di performance (KPI), derivati da 
grandi quantità di dati provenienti da fonti differenti, sono confrontati con la stima 
del valore che assumerebbero se fossero modificati gli investimenti. L’ensemble 
predittivo da noi realizzato combina modelli lineari generalizzati e machine learning. 
L’implementazione in R è inserita in una dashboard interattiva per realizzare 
previsioni in tempo reale. La visualizzazione scelta rende semplice la fruizione 
dell’analisi da parte dei top manager e incoraggia una nuova e approfondita 
sperimentazione attraverso i dati. 
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1 Motivating Example 


The Scenario Analysis project here described was realized for an international 
company active in the retail distribution that needed to understand the effectiveness 
of its digital allocation of investments on advertising, highlighting strategic 
opportunities otherwise ignored. We were required to build an interactive dashboard, 
addressed to top managers, showing the relationship between budget allocation and 
some typical digital analytics metrics coming from Google AdWords (GAW) and 
Adobe Analytics (AA). 

In the last decades, strategic market management has often been addressed from a 
theoretical point of view [1]. Our aim is to drive business decisions through data and 
modelling, allowing non-statistical entrepreneurs to exploit and interpret evidences 
that are not usually examined together due to their different sources. The main goal 
of Scenario Analysis is to show what would happen if something changed in the 
business decision making process. Responses can be influenced by a large number of 
factors, many of which can not be controlled by the experimenters or even quantified; 
for this reason, the “Scenario” concept treats such factors as fixed. 

In our application, we predicted Key Performance Indicators (KPIs) on the basis of 
differently allocated budget amounts. We addressed this Big Data problem by 
developing synthetic scores and working with machine learning models for quickly 
processing large amounts of data. 


2 Data 


We considered a temporal period ranging from December 2014 to February 2017 with 
weekly data updates. The large amount of digital data (~20GB), separately measuring 
different marketing actions, had to be joined and uniformed in both format and 
granularity. We addressed Clicks, i.e. the number of clicks on the advertisements 
(GAW); Impressions, i.e. the number of times the advertisement was shown (GAW); 
Impression Share, i.e. the percentage of impressions on the potential (available) 
impression amount (GAW); Page Views Per Visit, i.e. the average number of page 
views for each visit; Entries, i.e. the number of visits directly deriving from 
advertising (AA); Bounces, i.e. the number of entries with only one page view (AA); 
Orders (AA); Conversion Rate, i.e. the ratio between Orders and Entries cleansed 
from Bounces. 

As requested by our customer, we treated these quantities and the allocated budget 
on a daily basis for each marketing campaign. The variable Cost was also available; 
in fact, advertisers bid on certain keywords in Google AdWords auction [2] in order 
for their clickable ads to appear in search results, and pay Google a certain amount of 
money (Cost) according to the clicks received by their ads, with a Cost per Click that 
depends on the behaviour of internet surfers and competitors (Fig. 1). Optimal 
allocation of budget allows to achieve the highest possible profits with the smallest 
effective expense. 
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Campaigns were a-priori classified into three distinct macro-areas of interest: 
“Brand”, dedicated to empower the firm name; “Cart”, focused on item selling; 
“Promo”, regarding temporal offers. We were requested to consider budget 
modifications in the Scenario at a macro-area level rather than on single campaigns. 


3 Modelling and prediction 


Objective of the model was to predict the values of the metrics after budget leverages 
under the assumption that all other factors were fixed. 

The relationship between indicators changed according to the campaign macro-area 
(Fig. 2). The different origin, scale and variability of the KPIs under study required 
smart grouping on the basis of homogeneous behaviours. Clicks, Bounces and Entries 
were similarly influenced by investments, with a nearly linear tendency; on the other 
hand, Impressions and Potential Impressions depended on discrete budget jumps in a 
smooth nonlinear way, while Orders, Page Views and Visits required a model 
specification allowing different speeds of growth, while Orders required an ad-hoc 
approach able to address the non-negligible presence of zero values on the least 
relevant products. Finally, Page Views Per Visits and Conversion Rate consisted in 
calculated fields that were addressed by tuning the model on Page Views, Visits, 
Orders, Entries and Bounces. 

Because of the strong influence of discontinuous budget investments and to Google 
AdWords instantaneous dynamics, no significant temporal tendencies encouraging a 
time-series approach were found; the temporal dimension is only considered to 
combine contingent data. 


31 Model specification 


We adopted an ensemble approach [3] that combined a multivariate generalized linear 
model with Gamma or Poisson specification with a dynamic implementation of 
Random Forest [4]. The specific business requirement imposed the budget as the only 
actionable lever. In order to ensure model flexibility, we included campaign-level 
metadata and a campaign-group classification. The value for the n-th KPI reached by 
the i-th campaign on day ¢ was modelled as 


Knit = fn (ai Cut) 


where fn is a stochastic function defined for the n-th KPI, n=1,...,N, and Cj; is the 
amount of money (Cost) spent on the category l; of the i-th campaign, i=/,...,/: 


Cit = Bit 3 CoB, j (1 + Put) 
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with CoBi;: being the ratio between Cost and Budget B),,, that in our Scenario is 
assumed not to change after investment percent modification p+. The number J of 
the campaigns was 77 on the initial six months of the project, and changed at each 
data update. N is equal to 8, since this approach addresses Clicks, Impressions, 
Potential Impressions, Entries, Bounces, Orders, Page Views and Visits. Notice that 
some of these KPIs needed to be combined for calculating the quantities listed in Sec. 
2, that were shown to the end user!. 

The inclusion of individual dynamics (with macro-area parameters aj, instead of a;) 
turned out to be crucial. In the nonparametric machine learning ensemble member, 
campaign information is included as an explanatory variable. A grid-based search 
found 300 trees as the best compromise between model efficiency and computational 
effort for Random Forest. Regression and machine learning predictions were 
combined in an ensemble favouring the model with higher fitting performances on a 
10% test set. 

Daily KPIs were obtained as Kn, >), 


day and Knit being the ensemble result. Whereas, when a multi-day period T is 
desired, Kn = > ter Knt- Composite KPIs were calculated from these basic KPIs. 


Knit for n=1,...,N, with ¢ denoting the 


4 Dashboard visualization and Conclusions 


We realized an interactive dashboard [5] consisting in a simple visual interface relying 
on a computational engine integrating an R instance computing real time prediction 
for scenario comparison. 


Select Period: Select Date Range: Select Campaign Type: Select Campaign 
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Figure 3: Dashboard Layout. Left: budget modifications. Top-right: filters. Bottom-right: KPIs variations. 


1 Direct modelling of composite KPIs (such as Conversion Rate) would have been possible, too; however, 
it would have required conservation of the ratio among its constituents as a constraint for preserving 
consistency of the scenario. 
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A crucial feature of our approach was auto-tuning on the base of filtering: campaign 
and period selection provided the user with the best fit referred to his choices. A 
dynamic choice of the training period was provided in order to preserve stability and 
prevent overfitting. 

Model performances were evaluated through the improvement of the effectiveness 
of the data-driven marketing strategies on Brand, Cart and Promo, in terms of increase 
of Visits, Orders and Conversion Rate. As an example, in the period Sep 2016 - Feb 
2017 w.r.t. Mar 2016 - Aug 2016 such KPIs registered +89%, +51% and +10% 
respectively, with a 16% increase in Budget and an effective 12% saving in Cost. 
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Nonparametric classification for directional data 


Marco Di Marzio, Stefania Fensore, Agnese Panzera, Charles C. Taylor 


Abstract We discuss nonparametric methods to address the problem of classifica- 
tion for directional data. We focus on local regression and kernel density estimation 
when the domain is the unit circle. We provide asymptotic theory for the proposed 
methods along with simulation results. 

Abstract Discutiamo metodi non parametrici per problemi di classificazione di os- 
servazioni direzionali. In particolare consideriamo metodi di regressione locale e 
stima kernel di funzioni di densità nel caso in cui il dominio é il cerchio unitario, 
presentando alcune proprietà asintotiche e risultati simulativi. 


Key words: Density estimation, Discriminant analysis, Local weigths, Logistic re- 
gression, von Mises density 


1 Introduction 


Circular data occur when the sample space is the unit circle. The peculiarity of a 
circular measurement scale is that its beginning and its end coincide. After both an 
origin and an orientation have been chosen, a circular observation can be measured, 
in radians, by an angle 0 € [--x, x). Circular data often arise in biology (migration 
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paths, flight directions of animals), meteorology (wind and marine current direc- 
tions), and geology (orientations of joints and faults, landforms, oriented stones). 

We propose nonparametric classification methods for circular data based both 
on kernel estimation of the population densities and local logistic regression. In 
particular we consider the case when there are two sub-populations. This research 
filed seems unexplored, although the need for flexible methods is evident as is seen 
even in our very simple motivating example. 

The paper is organized as follows. Section 2 collects some basic results on ker- 
nel estimation of circular densities. Section 3 deals with two different approaches 
to nonparametric regression with a circular predictor and a binary response, while 
Section 4 discusses kernel density estimation for discrimination. Finally, Section 5 
presents some simulation examples. 


2 Kernel circular density estimation 


Given a random sample of angles 0;,...,0, from an unknown circular density f, 
the kernel estimator of f at 0 € [—7, 7) can be then defined as 


jew fo 


Il 
fi 


where the weight Kx is a circular kernel with zero mean direction and concentration 
parameter K > 0, see Definition 1 given by [2]. The weight function Kx is usually 
chosen to be a continuous density function whose support is the circle with the 
property that as x — co the density tends to concentrate at the mode. 

Now, for j € N and a circular kernel Kx, we set 


m , m 
ni(Kx):= | Kelæ)sin’(æ)dæ and v(Kx):= / K2(a)dar, 
x -=T 
where Kx is a r-th sin-order kernel if No(Kx) = 1, nj(Kx)=0 for j < r and n, (Kx) # 
0, see Definition 2 in [2]. 

Assuming that: f” is continuous at 0 € [—7, 7); Kx is a second sin-order kernel; 
as n — co, 1)7(Kx) and v(Kx)/n both go to 0; then it results 


_ 12(Kx) 
ne! 


E[f(0;«)] = f(8) f"(9) +.0(N2(Kr)), (1) 


and x 
var[ĵ(0; 1] = Va 


(2) 
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3 Nonparametric circular logistic regression 


One of the possible regression models for dealing with dichotomous data is logistic 
regression. The goal is to find the best fitting to describe the relationship between a 
binary outcome and a set of independent predictors. Logistic regression determines 
the membership degree of each individual to one of the two groups by fitting a 
continuous function taking values in the interval (0, 1]. 

Let Y and © be a binary response and a circular predictor, respectively, and set 
(0) := P(Y = 1 | © = 0). Denote the density functions in the circular covariate 
space for the successes (Y = 1) and for the failures (Y = 0) by fi and f2, respectively, 
and let 7) be the proportion of successes in the population, and 7 = 1 — 7. Then, 
for 0 € [—7, 7), 

_ m fi(@) 
mf (0)+7f:(0) 


4(8) (3) 


3.1 Kernel estimator for binary regression 


Given n independent copies of (O,Y), (1, Y1), ... , (On, Yn), assume that the sample 
has been ordered in such a way that the first nj pairs are successes and the last 
nz =n—n; ones are failures. Replacing x; in (3) with nj/n, j € (1,2), a kernel 
estimator of A(@), 0 € [—7, 7), can be defined as 


m/nfi(0:)) 
m/nfi(0;@0)+n2/nf(0; 4) 


where f;(0; K), j € (1,2) and K € {Y, œ, u}, stands for the kernel estimator of f;(0) 
with a circular kernel Kx. Estimators like the above one have been studied in the 
Euclidean setting by [4]. When the concentration parameters in (4) are y= @ = u 
the resulting estimator is the Nadaraya- Watson estimator with circular predictor, see 
[1]. When y = @, assuming that: both fı and f admit continuous derivatives up to 
order two; both Ky and XK, are second sin-order circular kernels, with 72 (Kx) and 
V(Kx)/n, K € {7, u}, both going to 0 as n + co; and that 72(Ky) ~ N2(Ky), and 
V(Ky) ~ V(Ky); using results (1) and (2), we obtain 


À(0:7,0,1) = (4) 


_ mm (N2(Ky) f2(9) f (9) — M2(Ku) fi (9) f3 (0)) 


E[A (0:7, u)] M0) 2(m fi (0) + m f2(0))2 t o(ņm(Ky)), 
and 
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For the case of a von Mises kernel, i.e. Kx(0) := {27.%(k)}~! exp (kcos(0)), 
where .4,(-) stands for the modified Bessel function of the first kind and order u, it 
holds that for K big enough 

g!/2 


1 
m(Ke) ~ Hi and V(Kx) ~ a_i” 


(5) 
As a consequence, using von Mises kernels for both Ky and Ky, with y ~ u, we have 


i n 1/2 
E(L(0:r.u)]-2()-0(*). and von =o (2E) 


Notice that in the special case where y = u, it is easily seen that asymptotic bias 
and variance of the resulting estimator are the same of the local constant estimator 
with circular predictor, see [1]. 

Concerning optimal smoothing the standard approach in the Euclidean setting is 
to consider a weighted version of the mean squared error. For practical implemen- 
tation the smoothing parameters are selected by minimizing an empirical version of 
the weighted mean squared error (for more details see, [4]). 


3.2 Local polynomial binary regression 


A different way to address the nonparametric binary regression estimation is based 
on the local likelihood approach. We start by defining the logit as a generic periodic 


functi 
nction g i A (0;) O: 
og (128) = g(0;, p), 


which depends on the observations and a vector of parameters 3. The inverse trans- 
formation goes back from log-odds to probabilities, yielding the following circular 
logistic regression function 


7 exp(g(@;, 3)) 
a as OW 


The associated log-likelihood function at 0 € [—7, 7), localized using kernel weights, 


© Sfrne( 29) ra 


i=1 


(6) 


that can be reformulated in terms of g as 


log-2 (6) = Y{(0.0% — log(1 + exp(g(@j, B)))}Kx(@;— 0). 
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By modeling the log-odds ratio as a sin-series expansion yields a nonparametric 
method that we define Circular Local Polynomial Logistic Regression (CLP). In 
particular, define 


g(@;, B) = e A (7) 


and, letting Bo be the solution for Bp of the maximization of log (6) with respect 


to B = {Bo,...,Bp}, we get 


_ expt) 


A(0:K —. 
1 + exp(Bo) 

Note that when p = 0, à (0; K) coincides with the estimator (4) with @ = y= u. 
Moreover, re-writing the log-likelihood function using different weights for the suc- 
cesses and failures as follows 


Dris(0, PKO - 8) -log(1+exp(g(@;.)))Ky (0; — 0), 


and using p = 0 in approximation (7) we obtain the estimator (4) with œ = Y. 


4 Circular KDE discrimination 


Kernel density estimation is commonly used for classification. Following the ap- 
proach proposed by [3] we consider two groups of observations and estimate the 
difference between the two densities at the observation point allocating the label 
according to the highest density. 

In particular, we consider the problem of estimating the difference between 
two circular densities, 1(0) = f2(@) — fi(0), using random samples of sizes nz 
and nı respectively. We are interested in solution 4(0) = 0 given by @ such that 
f1(00) = f2(00) = f(O0). Letting A(0; K1, K2) := h(0;K1) — f(0.10), under suit- 
able assumptions on M2(Kx;), V(Kx,) and f? (0), j € (1,2), in virtue of results (1) 
and (2) we have that 


EÑO; ki, 10)] = fo() —fi() +5 {a( Ke) (0 )—M2(Kx,) fi (8) } +0(M2(Ke,) +M2(Ke,)), 


and 


Var[h(0:k1,12)] = Map 0) | Me) l0) +0 o( | ee), 


n2 


When Kx, and Kx, are both von Mises kernels, the asymptotic mean squared 
error of fata point 9, such that h(@o) = 0, is 
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AMSE [hi i 2 (80) Aa)" 1 [aiai | zea, 


0 > ’ = T 
Dui | 4 K Kj 271/2 nı n2 
which is minimised by 


Ri = {f (00) /[2V Tn f! (80)? — (2V Tn F} (8))°/3 (2V Tm)? F} (8p) /3]}-7/5, 


and 


Ro = {f (00) /[2V Tm f} (90)? — (2V Tn fy (8o))°/3 (2V 1) 73 Fio) PA. 


5 Numerical examples 


In Figure 1 we observe examples of the behaviour of the parametric logistic regres- 
sion. It is obtained using g(9;,8) := Bo + Bı cos(6;) + Bo sin(0;) as link function and 
a uniform circular density as a weight. We can see that it works perfectly for simple 
patterns, i.e. when the one-labelled group and the zero-labelled one are well sepa- 
rated. It works a bit worse when the data are disturbed. Finally the logistic regression 
is very poor when the patterns are more complex. A way to avoid this problem is to 
localize the logistic regression introducing a non-uniform spatial weight. Here the 
weight function used is the von Mises kernel with a concentration parameter equal 
to 3. An example of the behaviour of the estimator obtained using expression (7) 
with p = 1 is depicted in Figure 2. 

On the KDE discrimination side, we consider two samples of nı = n2 = 100 ob- 
servations (black and red rugs) labelled by 1 and 0 respectively. They have different 
means and equal variances: the first sample is drawn from a vM(1.5,3) population, 
the second one is drawn from a vM(4.5,3) population. We obtain two kernel density 
estimates selecting the concentration parameters by least squared cross-validation 
(LSCV). We assign to each observation the label of the population exhibiting the 
highest density at it. In this example we obtain a misclassification rate equal to 
0.5%. See Figure 3. 
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Fig. 1 Examples of Parametric circular logistic regression 
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Fig. 3 Example of circular discrimination using KDE between samples drawn from vM popu- 
lations with different means and equal variances. The two bandwidths have been automatically 
selected using LSCV. 


Introduction to Symbolic Data Analysis and 
application to post clustering for comparing and 
improving clustering methods by the Symbolic 
Data Table that they induce 


Edwin Diday 


Abstract First we recall that Symbolic Data Ana lysis (SDA) is a way of thinking by 
classes in Data Science. We recall that classes of standard units are in SDA the new 
statistical units of higher level than the initial standard statistical units. In SDA 
classes are considered as objects to be described in all their facets by “symbolic 
data” taking care on their internal variability by staying close of the user language. 
Then we focus on different strategies of building a Symbolic data table from a 
standard data table by using: clustering (k-means, dynamic clustering), Fuzzy 
clustering (by EM, others), mixture decomposition of Copulas (by a “copula-EM” or 
a “copula-dynamic clustering”). Few words will be said also on how building classes 
at the second level (where the units are classes), by using Dirichlet models. Then, we 
give tools in order to measure the quality of the obtained symbolic data tables. By 
this way we can compare the different associated clustering methods and improve 
them. Finally, we show how to summarize the obtained symbolic data tables by 
symbolic data and also show how to visualize and compare them and their associated 
clustering methods, for example by an extension of PCA to symbolic data tables 
directly (Diday 2015) or by using a Wassenstein distance (Irpino, Verde 2015, 
2016). 


Key words: Symbolic Data Analysis, classes, objects, clustering, symbolic data 
table, Principal component analysis of symbolic data. 
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1 Introduction 


Standard data tables are defined by a set of statistical units described by standard 
(numerical or categorical) variables. Complex data are data not reduced to just a 
standard data table but at the contrary they are defined by several data tables with 
different statistical units and different variables. Classes of statistical units are 
obtained in unsupervised learning by a clustering process allowing a concise and 
structured view on the data, in supervised learning, classes are used in order to 
extract efficient rules form the data. 

A third way of thinking by classes consist in describing the classes (as « objects ») in 
their different facets by « symbolic data » in order to take care on their internal 
variability by intervals, bar charts, histograms, quantile functions, etc.. allowing the 
fusion of complex and big data in an explanatory framework. These symbolic 
descriptions leads to an extension of standard data tables (of numerical or categorical 
variables) to symbolic data tables (of symbolic value variables) and therefore to an 
extension of standard Data Analysis to Symbolic Data Analysis. One of the 
advantage of this approach is that unstructured data and unpaired samples at the level 
of row units, become structured and paired at the level of classes. Another advantage 
is that the symbolic class description has a high explanatory power as it is expressed 
in terms of the initial variables so in term of the user language. The study of such 
new type of data, built in order to describe classes in an explanatory way, has led to a 
new domain called “Symbolic Data Analysis” (SDA). Several international Journals 
as ADAC (first journal in Classification) and Man And Cybernetics has recently 
(2015, 2016), provided special issues on this topic. We can also mention several 
books (Bock, Diday (2000), Billard, Diday (2006), Diday, Noirhomme (2008)) and 
review papers (Billard, Diday (2003), Brito (2014), Diday (2016). 

From the standard datable of the players described by their weight, nationality, 
number of goals, it is easy to build the Table 1 where each cell express the variability 
inside each class of the players of each team. 

In the ground data table 1, the variability (of weight, nationality, number of goals) 
concerns the players of each team or of the team itself for the number of goals in a 
match. Therefore, by a fusion process (which transform the ground data table in a 
“symbolic data table”) each cell can contain: an interval (min-max, interquartile 
intervals or else of the weights of each team players), a sequence of categorical 
values (the nationality list of each team players), a sequence of weighted values as a 
bar chart (expressing the frequency of the number of goals in the match of a season), 
a histogram, a quantile function or simply a number (as the mean age of the team 
players, or the correlation between two variables as the correlation between the age 
and the weight inside each team), etc. This new kind of variables is called 
“symbolic” as they cannot be treated as numbers. 

Any data set which cannot be considered as a unique data table of standard statistical 
units described by standard (numerical or categorical) variables is considered as a 
“complex data set”. This situation happen often and in many domains. For example, 
in the NSI (National Statistical Institutes), often the units to be studied are “regions” 
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characterized by different data tables of different units (hospitals, schools, 
inhabitants) and of different variables. A unique symbolic data table where the units 
are the regions described by symbolic data can be obtained by aggregation of the 
variables describing hospitals, schools and inhabitants. 


Table 1: Teams are described by variables of symbolic values taking care on the variability of players 
inside each team for their weight and nationality and on the variability of the team itself for the number 
of goals in the match of a season. 


TEAM WEIGHT NATIONALITY NB OF GOALS 
BARSA [75,89]  { Spane, Arg, ..} {0.8 (0), 0.2(1)} 
MANCHESTER [80,95] {Fr, Alg, Arg } {0.1 (0), 0.3 (1), ...} 
PARIS-ST G. [76, 95] {Fr, Tunisia } {0.4 (0), 0.2 (1), ...} 
DORTMUND [70,85] {Fr, Engl, Arg} {0.2 (0), 0.5 (1), ...} 


2 Building symbolic data tables from clustering methods 


By partitioning, the dynamic clustering method (DCM) and k-means is based on a 
“representation function” of any partition P = (Pi, ..., Pk) called g such that g((P) = L with L 
= (Li, ..., Lk) and an “allocation function” f which associates to k representation L = (Li, ..., 
Lx) a partition P = (Pi, ..., Pk). The representation can associate to any class a distribution 
(Diday, Schroeder (1975)), a regression, factorial axis, points of the population (see Diday 
(1973)). For an overview see Diday, Simon (1979). In the special case where the 
representation function associates a mean to each class we get the K-means method as a case 
of DCM.. 

By using alternatively these two functions, this method converges towards a partition 
(i.e. a set of classes which covers the population and a representation associated to 
each class). The quality criterion can be written in the following way: 

W(P,L) = Yi = 1, x w(Pi, Li) where w measures the positive “fit” between each class 
and its representation and decreases when the fit increases. (For example, if L; is a 
distribution, then w(Ci, Li) can be the inverse of the likelihood of the class Ci for the 
distribution Li). 

In the following we settle: P™ = (PM), ..., P.™,) and L® = (L™, ..., L™,). 

Starting from a partition P, the value of the sequence un = W(P™, L™) decreases 
at each step n of the method. This can be proved in the following way. 

First, by the reallocation of each individual x belonging in a class P™ in a new class 
f(L®) = PCD; such that the new classes P*); obtained by this reallocation 
improves the fit with L®; in order that we get w(P;, L®;) < w(P®;, L®; ) for i= 
1,...,k which implies: 

Vi 1, WPD, LO) < Yi-1,xw(P™, L®; ) and therefore: 

WEED, L™) < W(P®, L®)= un 

Second, by the representation process of each class P®*);. In that way, we can define 
a new representation: L™) = (L@),, ..., LED) where L™); = g(P ©+D;) fit better 
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P©+D; then L®;. This means that w(P®*D;, LOD) < w(P®;, LO; ) for i= 1,...,k 
and therefore: 
Une = W(PAtD, LCD) < WP"), L® ). Hence, at this step, we get: 
Un = W(P®D, L&D) < WPS), L® ) < W(P®, L™) = un . Therefore, as the 
positive sequence un decreases at each step, it converges. 
Notice that in the reallocation process the allocation of an individual to a new class 
can be done at best, which means that it is allocated to the class which decreases the 
most the criterion. Notice also, that a simple condition in order that the sequence Un 
decreases is that w(C, g(C)) < w(C, L), V L for any class C and any representation L 
of this class. 
In partitioning by mixture decomposition, we can apply the DCM in the case where 
g(C) is the law which fit the best the class C and following a given model (Gaussian, 
Poisson, Gamma etc., see Diday, Schroeder (1975)). Its parameters are induced by a 
likelihood estimation. It is then easy to show that the sequence un converges until a 
partition and a vector of k laws such that each law fit at best its associated class. 

The EM method (Dempster et al. (1977)) produces a fuzzy partitioning where each 
fuzzy class Cj (defined by a membership weight t;(x;) associated to each 


individual X;j) is associated to a law of maximum likelihood of the mixture 

decomposition of the probability density of the population decomposed in k 

weighted density functions expressed in the following way: 

f= Ekipy f(x,a;), 

with j = 1, k, Zip 1and0<p;<1and f(x,a;)is the probability density of 

parameter aj and pj; is the likelihood estimator of the proportion of individuals 

following the density f (x, aj) in the mixture. 

The method aims to maximize alternatively the following likelihood equation: 

L(x1, ..., Xn, 21, ... Ak, Pis -o Pe) =D Log Xp; f(xpaj) (1) 

The method can be decomposed in two steps: “Estimation” and “Maximization”. In 

the “estimation” step, for j = 1, k andi = 1, N, the tf (x;) which are the a posteriori 

probabilities that the individual xi belongs to the class j at step n are determined by: 
tt (x= pt f(x» af Xi-1pî f(x aj) (2) 

In other words t? (x;)is the fuzzy membership weight of the individual x; to the 

fuzzy class Cj induced by the law f(., a}*). 

In the “maximization” step, first the weight associated to each law f(., aj) is 


determined by: 
1 
pre! vec 1=1tj (x) (3) 
Then, the parameters aj'** are obtained by the likelihood maximization of the 


equation (1), where lis the number of coordinates of the parameter a} ++. 


3 È. nti 
Vj=1,km=1,! DC) PE Se (4) 
ea jm 
Notice that at the contrary of the DCM the clusters induced by the EM method by 


attributing to each individual, its best density function (i.e. the one of higher density 
value) does not follow this density function. 
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How to build the symbolic data table from a mixture decomposition? 
By the analytical approach: 
It is easy to build the marginal of the joints associated to each class by DCM or EM 
and then to obtain several kinds of symbolic variables describing each class. These 
symbolic variables can be built in order that their value be a density or a quantile or 
a distribution function, a histograms or a bar chart (depending on the kind of initial 
variable: numerical or categorical), a min-max or an inter-quartile intervals, a 
percentiles list etc. The graphical representation of the obtained densities functions 
can be obtained by a kernel method by using bandwidth which can allow a nice 
graphical representation inside the cells of the SDT (see Silverman (1986), Jones et al 
(2012). Histograms are another way to represent the obtained marginal of the joints 
with a good explanatory power. In both cases, the bandwidth for the density function 
and the intervals for the histograms has to follow two conditions. The first condition 
is that for a given variable they must be the same for all the classes (in order that 
they become comparable), the second condition is that the induced density function 
or histogram graphical representation must discriminate at best the classes. An 
efficient way in order to satisfy these two conditions in the case of histograms is 
given in Diday et al (2013). 
By using the membership weight of the individuals to the classes 

Another way to get the symbolic data table and the graphical representation of the 
marginal associated to each obtained joint, is to use the weights associated to each 
individual in each class. Then, by addition we can build the same kinds of symbolic 
variables. In case of a class Cj and an initial numerical variable Y;, we obtain, a 
histogram denoted Hjr, such that: 


N Luna SN Piedi (2) 

He ¢ Eh, tjai) a, (27) ) Lins 7%) 4, (27 ) (5) 
‘it ee eon 7-4: (YL po oP CN (YI IA 2) 
Frei Drea tj HG, 7) Erz biz tj RDA) 


Where x = Y(x;) and Or!) is a Dirac based vector, defined by a partition (J;, ..., 


ly) of the numerical variable Y, domain D, in V intervals such 
that: Ax) = (4, G7), ..., S (7) where &, (cf) takes the value 1 if x7 € I, and 
0 elsewhere. 

In case of categorical variable with categorical values denoted I1,..., Iv, we can 
obtain, in an analogous way, a symbolic variable with bar chart value. In this case, 
ô (xT) takes the value 1 if x{ takes the category I, and 0 elsewhere. 

Notice that the fuzzy partition produced by EM induces an exact partition (C’), ..., 
C’x) such that 

C’. = {xi/ f(xi, ax) 2 f(x, ax)}. In case of an exact partition obtained from EM or 
from DCM we have t;(x;) =1,foranyj = 1,k, and i = 1, Nin the formula (5), 
from which we can built a histogram (resp. bar chart valued variables). from 
numerical (resp. categorical variables). In this case the result is biased for EM (resp. 
not biased for DCM) as it does not correspond (resp. correspond) to the EM (resp. 
DCM) obtained densities. At the contrary, we can use the formulas (5) for EM and 
for DCM, we can replace t;(x;) by w({%;},L)) in formulas (5) (where w({x,},L)) 
measure the fit between the individual x; and the representation Lj , see section 
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4.1.1) of the class Cj . In this case the result is biased for DCM (resp. not biased for 
EM) as it does not correspond (resp. correspond) to the DCM (resp. EM) obtained 
densities. 


Copulas have already been used in SDA at the level of a symbolic data table as input 
(see Vrac and al. (2011)). Here, our aim is to use copulas on the initial standard data 
table in order to produce a SDT. The advantage of this copulas approach is that we 
obtain directly the marginal law associated to each class which then induce directly 
the symbolic data table (and not in two steps as in the preceding clustering methods). 
By partitioning, we can apply the DCM in the case where g(C) is the copulas which 
fit the best the class C and following a given copulas model (Gaussian, Frank, etc.) 
By fuzzy partitioning, the model is the following: 

f(x) = Zip; CoPac, (F (Xi1»@1),--» f %iqrqj))» Where here q is the number of 
initial variables and a = ((@C,, @y4,+.,Qq1)) yes (QCy , Gig, ...,0gg)) defines the 
parameters of the mixture where ac; is the parameter of the copulas model associated 
to the fuzzy class j. 

The EM algorithm can then be extended to a Copulas-EM algorithm by using in the 
formulas (1) to (4), Copac, (f(X;1,01;),.--.f(Xig:@g;)) instead of f(x, a;) for i= 
1, N andj =1,k. 

Having obtained the marginal f(x;;,@,;) for i = 1, N , s = 1,q andj = 1, k, we can 
build (as in the preceding clustering methods) a symbolic data table by many kinds 
of symbolic variables (density functions, histograms, min-max or inter-quartile 
intervals, percentiles list, etc.). 


In the case where the initial data are defined by a symbolic data table, we start from a 
SDT where each row describe a class of individuals of the ground data table. By 
using random variables with symbolic values following for example a Dirichlet 
model, all the clustering methods presented in this section can be extended to this 
more general situation by transforming the description of each individual of the 
ground data table in a Dirac mass vector as in the preceding section. Notice that as 
we star from a SDT, the obtained marginal are laws of laws, therefore in order to get 
an explanatory SDT in the case of a Dirichlet-EM clustering method which leads to 


K’ clusters of classes, it is better to use the sum Y#.,t,,(C,)hd where Al is the 
symbolic value (a histogram, for example) of the r th symbolic variable for the class 
Cj and t;,(C;) is the fuzzy membership weight of the class C} to the fuzzy cluster of 
classes denoted Cp for j’ = 1, K’. Hence, in the obtained SDT, Y¥_,t;,(G)h2 is the 


value of the r th symbolic variable for the row associated to the cluster Cj of classes. 
Another kind of fuzzy clustering in case of initial symbolic distributional variables 
can be found in Irpino et al (2017). 
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3 Comparing and improving the clustering methods by using 
quality criteria of the Symbolic Data Table (SDT) that they 
induce. 


How to measure the explanatory power of a SDT? 

The explanatory power of a clustering method can be measured from the explanatory 
power of the SDT that they induce. Basically, the more the rows of a SDT are 
different the higher is its explanatory power. Many other kinds of quality criterion of 
a SDT can be used as the entropy of the symbolic values and the correlation between 
the symbolic variables (see for example in Diday (2013, 2016)). A general way is to 
use the dissimilarities two by two between the symbolic descriptions of the classes. 
These dissimilarities can be adapted to each kind of variable (Wassenstein between 
quantiles functions, Haussdorf between intervals, etc., see Diday, Noirhomme (2008) 
for other dissimilarities between symbolic data). In that way we can associate to any 
SDT a set of dissimilarities values two by two between its rows. This set of numbers, 
which are between 0 and 1 is denoted D. 

By this way, we can describe each clustering method by a set of symbolic variables 
(induced from D), which values can be a probability density, a quantile function, a 
histogram, an inter-quartile interval, a set of percentile values, etc.. Finally, we get a 
SDT where each row is associated to a clustering method (and its induced SDT) and 
each column is associated to a symbolic variable induced by its set D. We can then 
compare the efficiency of the different clustering methods on a given ground data 
table, in term of the explanatory power or other qualities of the obtained SDT. 
Several SDA methods can be used, for example, a visualization allowing a 
positioning of the different clustering methods by an extension of PCA to symbolic 
data directly (Diday 2013) or by using a Wassenstein distance (Irpino, Verde 2016). 
How improving the explanatory power of clustering methods by using the SDT? 

Often clusters are overlapping and it is difficult to say in which cluster to allocate an 
individual which is at the bridge between two clusters. We can improve the 
explanatory power (or other quality criteria) of the SDT associated to obtained 
clusters by reallocating iteratively the individuals by improving at each step the 
explanatory of their associated SDT, until convergence. More precisely, having a 
partitioning as starting point, each individual can be allocated to the class which 
symbolic description improve the chosen quality criterion of the symbolic data table. 
The new classes can be described by new symbolic data, we can then reallocate each 
individual and so on until convergence. 


4 Conclusion and perspectives 


In practice most of the given data contain a class variable from which an SDT can be 
build and analyzed by SDA. In this paper, we are interested in the case where the 
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classes are not given. Therefore, several kinds of clustering methods have been 
recalled or settled and we have shown how they yield to a SDT which enhance their 
interpretation, allows their comparison and improve them by several quality criterion 
(as their explanatory power). This paper opens several direction of research which 
need to be deeply studied and experimented. 
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Identifying Meta Communities on Large 
Networks 


L’identificazione di meta comunità in reti di grandi 
dimensioni 


Carlo Drago 


Abstract On large networks there is a specific need to consider specific patterns 
which can be related to structured groups of nodes which could be also defined as 
communities. In this sense we will propose an approach to cluster the different com- 
munities using interval data. This approach is relevant in the context of the analysis 
of large networks and in particular on discovering the different functionalities of the 
communities inside a network 

Abstract In networks di larghe dimensioni esiste la necessita di considerare strut- 
ture dati relativi a gruppi di nodi definite comunità. In questo lavoro proponiamo un 
approccio di clustering su rappresentazioni delle communità basate su dati ad inter- 
vallo. Questo approccio e’ rilevante nel contesto di networks di larghe dimensioni 
al fine di identificare le diverse funzionalità nella rete stessa. 


Key words: Social Network Analyis, Community Detection, Symbolic Data, Clus- 
tering 

There are important cases in which it could be very important to cluster the com- 
munities of a network. For example an important case is explained by Fortunato 
2010 [6]: different communities are associated with different behaviors on the net- 
work. The different nodes on the community can be associated to a specific function 
or behavior inside the network. In order to predict the future behavior of the different 
nodes it could be crucial to determine the different communities and to understand 
the different patterns of similarity it is possible to observe. In this sense understand- 
ing the concept of community on a network leads to a better understanding of the 
network behavior as a whole (see Coscia et al. 2011 [3]). So it is necessary to rep- 
resent adequately the community to understand the entire network. The problem of 
the adequate representation is particularly relevant on big data. In thise sense we 
have to decide the best representation to use. Clustering a community means taking 
into account on the clustering process all the structural features and the communities 
and the attributes of the different nodes considered as a whole. In this sense we can 
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consider the structural features and the attributes of the nodes (which characterize 
the community). In order to cluster communities it is necessary to adequally repre- 
sent the problem of representing the community structure of a network (Drago 2015 
[5]). In particular it is relevant to find the nodes that show similar characteristics 
to each other and the nodes loosely connected with other nodes belonging to other 
communities (see Girvan and Newman 2002 [8]). At the same time it is particularly 
important to propose an approach which could be based on interval data because 
we want to consider the entire community (on the use of symbolic data in network 
analysis see Giordano and Brito 2014 [7]). Communities are a very relevant object 
to consider. In fact, on a specific network, the different vertices tend to react as a 
whole and so it could be relevant to cluster them as a whole. 


1 Community Identification 


The first step on the analysis is based on the need to determine the different commu- 
nities inside the network. In order to determine the different communities we need 
to consider an appropriate community detection algorithm (Zhao et al. 2011 [14] 
and Blondel et al. 2008 [2]). We start from a network G defined as: 


G=(V,E) a) 


Typically the community detection methods tend to focus on the connections 
among the different nodes that are part of the same community. There are cases 
in which the different nodes tend not to fit with the communities identified. The 
general assumption of these methodologies is that there is a direct emphasis on con- 
sidering the ties inside the communities, rather than the ties which connect members 
of different communities (Zhao et al. 2011 [14]). Usually the relevant requirement 
for detecting a community is connectedness. In particular we can expect a strong 
connection among the nodes that are part of the community. For detecting commu- 
nities we need to take into account the modularity which can measure explicitly the 
capacity of a network to be divided into different modules (Blondel et al. 2008 [2]). 
At the same time the modularity allows the identification of the different communi- 
ties. The modularity (see Newman 2006 [10]) needs to be computed by considering 
a null model not considering the community structure of the structure (i.e. a random 
graph). So following Fortunato 2010 [6] we can define the modularity in this way: 


1 
aes Lli — Ki.j)Y(C;,C;) (2) 


where m is the number of the edges on the network, A is the adjacency matrix 
considered and K; j is the number of edges which can be considered between the 
vertices i and j on the null model. Finally it is possible to consider the y function 
which returns two possible values: 1 where the two vertices i and j belong to the 
same community and 0 if they belong to different communities. However in order 
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to take into account also the degree of the vertices i and j (it is important to consider 
the degree distributions) we can write the modularity as follows (Fortunato 2010 [6] 
and Newman 2006 [10]) : 


1 kjk; 
= aij = zp YCC) (3) 


ij 


Q 


where k; and kj are degree values for different vertices. In this way we obtain 
the different community inside the different network. A different approach in this 
sense is the one followed by Reichardt and Bornholdt [12] which introduces a dif- 
ferent approach on null models and a general one (see Newman 2006-2 [11]). These 
communities are important because they are a stylized way to represent the different 
structure of the network. In this sense we need to take in to account the entire groups 
of nodes as a whole in order to cluster the communities considered entirely. In this 
sense we consider all the nodes singularly by considering their statistical charac- 
teristics From the different communities we can start to cluster them by building 
an adequate data matrix. In particular we have to consider for each comunity the 
measurements for the taking into account of the different intervals. 


2 K-Means Clustering of the Communities 


The different communities are characterized by a vector with the different n obser- 
vations related to the vertice for the same variable or attribute. So we can have: 


Pai (4) 


We can write the interval data (the measurement for the network community) in 
this way: 


x = [x,a] 6) 


Each interval of the considered variable represents a measurement for the single 
community. In this way we obtain for each different community the measurements 
relating to the single vertices, but at the same time observations related to the dif- 
ferent communities (represented by the intervals). It is important to note that the 
intervals are at the same time characterized by their radii and the midpoints. In this 
case each community can be also represented by a midpoint value. In particular it is 
possible to obtain a value of the interval midpoint for the generic variable b: 


1 


Lb 3 
Xcenter = 2 (x +) (6) 
And the radius of the interval: 
xo -la 7 
radius 7 2 -= x) (7) 
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In order to cluster the different communities we depart from the measurment 
of the community characteristics by using interval data. Many different clustering 
algorithms were proposed in order to cluster interval data. In particular interval clus- 
tering was firstly considered. 


3 Simulation Study 


We can start from considering different networks simulated using different char- 
acteristics. In this sense we consider different networks obtained by using the R 
package igraph (see Csardi Nepusz 2006 [4]). In order to perform the data analysis 
at community level we have also used the package RSDA (Rodriguez (2004) [13]). 
In this sense we consider different groups of networks (for example: Barabasi Albert 
graph models Barabasi Albert 1999 [1]) 

We have considered many different networks in order to test the approach on 
different structures. Here we present some different examples we have obtained 
from the simulation study. The results obtained on the examples are important in 
order to derive some interpretation rules which can be considered on the results of 
the proposed approach. In the first case we consider an example from the network 
based on Barabasi game. We consider on the example a network based on 100 nodes. 
We obtain 6 communities as part of the community structure. 


Fig. 1 Barabasi Model Simulation using 100 nodes: Community Structure 


From the clustering analysis we can observe that there is a group or cluster of 
communities at the center which shows similar structural characteristics. This could 
be also observed by considering the different interval scatterplot diagrams. That 
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means we are able to identify some groups of communities which tend to show 
similar characteristics. 


Fig. 2 Barabasi Model Simulation using 100 nodes: Community Structure visualized. On the x- 
axis the degree, on the y-axis the eigenvector centrality scores 


Then we consider the 9 different communities obtained on the clustering process 
using the K-Means. In this case we consider different data matrices using different 
specifications of the data matrices in order to evaluate the results using different data 
matrices and using different relevant structural measures. Two interpretative results 
need to be noted: on the simulations it is possible to observe that the lower bound 
seems not so relevant. In particular the upper bound is relevant to discriminate the 
different communities. In this sense by the differences which can be observed by 
the different intervals can be determined specifically the differences on the upper 
bounds. The visualization shows an overlapped structure because there is a central- 
ized structure of the network. This structure tends to cluster specifically the com- 
munities in a central position. In this case the betweenness is related to the higher 
degree. It is also possible to note that the different nodes can be characterized to 
different groups of similar nodes when they are considered specifically on the com- 
munities. In fact it is possible to note it is different to consider the nodes as part of a 
community or the nodes singularily. 


4 Conclusions 


The results we have obtained confirm the usefulness of the approach on large net- 
works. In particular the result is useful to determine specifically the community 
structure and some different meta-communities which can be identified on a specific 
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network. By starting from the meta communities (along the definition of Kalinka 
and Tomancak, (2011) [9]) it is possible to obtain the different prototypes. In par- 
ticular a relevant insight related to the results is that in the case of the clustering 
communities as groups of nodes we can obtain different results from those consid- 
ering the clustering of the single node. In this sense the analysis can be enriched 
by the fact that in some cases the nodes have on their communities relevant dissim- 
ilarities which need not be taken into account when the analysis is performed by 
considering the communities as a whole. At the same time clusters of communities 
(or meta-communities) can be characterized as behavior by nodes which participate 
in the community on different levels. The approach considered in this work allows 
the exploration of these levels of interaction between the different nodes. 


References 


1. Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. science, 
286(5439), 509-512. 

2. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of 
communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 
2008(10), P10008. 

3. Coscia, M., Giannotti, F., & Pedreschi, D. (2011). A classification for community discovery 
methods in complex networks. Statistical Analysis and Data Mining: The ASA Data Science 
Journal, 4(5), 512-546. 

4. Csardi G, Nepusz T (2006) The igraph software package for complex network research, In- 
terJournal, Complex Systems 1695. 2006. http://igraph.org 

5. Drago C. (2015) Exploring the Community Structure of Complex Networks. Annali del 
MEMOTEF - Note e Discussioni 10/2015; 2(forthcoming). 

6. Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3), 75-174. 

7. Giordano G., Brito P. (2014) Social Networks as Symbolic Data, in Vicari D., Okada A., 
Ragozini G., Weihs C. Eds., Analysis and Modeling of Complex Data in Behavioral and 
Social Science, Springer: Heidelberg, pp. 133-142; 

8. Girvan, M., & Newman, M.E.. (2002). Community Structure in Social and Biological Net- 
works. 

9. Kalinka, A. T., & Tomancak, P. (2011). linkcomm: an R package for the generation, visualiza- 
tion, and analysis of link communities in networks of arbitrary size and type. Bioinformatics, 
27(14). 

10. Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the 
National Academy of Sciences, 103(23), 8577-8582. 

11. Newman, M. E. (2006). Finding community structure in networks using the eigenvectors of 
matrices. Physical review E, 74(3), 036104. 

12. Reichardt, J., & Bornholdt, S. (2006). Statistical mechanics of community detection. Physical 
Review E, 74(1), 016110. 

13. Rodriguez O.R. with contributions from Olger Calderon and Roberto Zuniga (2014). 
RSDA: RSDA- R to Symbolic Data Analysis. R package version 1.2. http://CRAN.R- 
project.org/package=RSDA 

14. Zhao, Y., Levina, E., & Zhu, J. (2011). Community extraction for social networks. Proceed- 
ings of the National Academy of Sciences, 108(18), 7321-7326. 


Random Forest-Based Approach for 
Physiological Functional Variable Selection for 
Driver’s Stress Level Classification 


Neska El Haouij, Jean-Michel Poggi, Raja Ghozi, Sylvie Sevestre Ghalila, and 
Mériem Jaidane 


Abstract With the increasing urbanization and technological advances, urban driv- 
ing is bound to be a complex task that requires higher levels of alertness. Thus, the 
drivers mental workload should be optimal in order to manage critical situations 
in such challenging driving conditions. Past studies relied on drivers performances 
used subjective measures. The new wearable and non-intrusive sensor technology, 
is not only providing real-time physiological monitoring, but also is enriching the 
tools for human affective and cognitive states monitoring. This study focuses on 
a drivers physiological changes using portable sensors in different urban routes. 
Specifically, the Electrodermal Activity (EDA) measured on two different locations: 
hand and foot, Electromyogram (EMG), Heart Rate (HR) and Respiration (RESP) 
of ten driving experiments in three types of routes are considered: rest area, city, 
and highway driving issued from physiological database, labelled drivedb, available 
online on the PHYSIONET website. Several studies have been done on driver’s 
stress level recognition using physiological signals. Classically, researchers extract 
expert-based features from physiological signals and select the most relevant fea- 
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tures in stress level recognition. This work aims to apply a random forest-based 
method for the selection of physiological functional variables in order to classify 
the stress level during real-world driving experience. The contribution of this study 
is twofold: on the methodological side, it considers physiological signals as func- 
tional variables and adapts a procedure of data processing and variable selection. On 
the applied side, the proposed method provides a ”blind” procedure of driver’s stress 
level classification that do not depend on the expert-based studies of physiological 
signals. 


Key words: Random Forests, Variable Selection, Functional Data, Physiological 
Signals 


1 Introduction 


This paper aims to provide a random forests-based method for the selection of phys- 
iological functional variables in order to classify the stress level experienced during 
real-world driving. For that, we present first the context of our work which concerns 
the affective computing aspects with a summary of the study introducing the physio- 
logical database drivedb. Then, methods on functional data, variable selection using 
random forests and grouped variables importance are addressed. The contribution 
of this study is twofold: on the methodological side, it adapts the scheme proposed 
by [6] to take advantage of the functional nature of the physiological data and offers 
a procedure of data processing and variable selection. On the applied side, the pro- 
posed method provides a blind (i.e. without prior information) procedure of drivers 
stress level classification that does not depend on the extraction of expert-based 
features of physiological signals. This allows automatic exploration of promising 
signals to be included in statistical models for driver’s state recognition. 


2 Stress level recognition while driving 


Many research groups tried to provide solutions and tools to vehicles and roadway 
users in order to improve safety, efficiency and quality in the sector of transport. [14] 
points out that according to the American Highway Traffic Safety Administration, 
high stress levels impact negatively drivers reactions especially in critical situations. 
It is one of the most prominent causes of vehicle accidents such as intoxication, fa- 
tigue and aggressive driving. In real world driving, human affective state monitoring 
can offer useful information to avoid traffic incidents and provide safe and comfort- 
able driving. 

With the increasing urbanization and technological advances, the new wearable 
and non-intrusive sensor technology, is not only providing real-time physiological 
monitoring, but also is enriching the tools for human affective and cognitive states 
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monitoring. In particular, several studies have been reported the last years in the field 
of driver’s stress monitoring. In this paper, base our analysis on the study of [10] 
where they presented a protocol of physiological data collection in real-world driv- 
ing conditions in order to detect stress levels. Specifically, physiological signals 
such as Electrodermal Activity (EDA), Electrocardiogram (ECG), Electromyogram 
(EMG) and Respiration (RESP) were captured for 24 driving experiences. 

Features derived from non-overlapping segments of physiological signals taken 
from rest, highway and city of the driving experiences. The first analysis aiming 
to classify the stress levels allows to distinguish between the three levels of driver 
stress with an accuracy of 97%. The second analysis concerns the study of the cor- 
relation between extracted features from physiological signals and a stress levels 
metric created from the video tape. In this study, [10] reported that there is a cor- 
relation between driver’s affective state quantified by the stress levels metric and 
the physiological signals, the highest correlation is with the EDA and HR. They 
have partially released their physiological database, labeled ”drivedb”, on-line on 
the PhysioNet website!. The data used in our work were extracted from the drivedb 
database which has a clear annotation of the several driving periods for each experi- 
ence, allowing an easy exploitation of the information. Apart its availability on-line, 
various studies were based on this database which constitutes a main reference on 
stress level recognition in highway and city driving. 


3 Functional Variable Selection 


The main issue of variable selection methods is their instability where a set of se- 
lected variables may change when perturbing the training sample. The most widely 
used solution to solve this instability consists in using bootstrap samples where a 
stable solution is obtained by aggregating selections achieved on several bootstrap 
subsets of the training data. Random forests algorithm, introduced by [1], is one 
of these methods based on aggregating a large collection of tree-based estimators. 
These methods have good predictive performances in practice and they work well 
for high dimensional problems. Their power is shown in several studies summarized 
in [15]. Moreover, random forests provide several measures of the importance of the 
variables with respect to the prediction of the outcome variable. It has been shown 
that the permutation importance measure introduced by Breiman, is an efficient tool 
for selecting variables ([2, 5, 7]). 

The standard approach in Functional Data Analysis (FDA) (see for example [13, 
3]) consists in projecting the functional variables into a space spanned by a func- 
tional basis such as splines, wavelets, Fourier. Several regression and classification 
methods were the focus of studies in two situations: with one functional predictor 
and recently for several functional variables. 


1 http://physionet.org/ 
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Classification based on several possibly functional variables has also been con- 
sidered using the CART algorithm for similar driving experiences in the study 
of [12], using SVM in [16] work. Variable selection using random forests was 
achieved in the study of [4]. In our study, multiple FDA using random forests and 
the grouped variable importance measure proposed by [6] are used. 


3.1 Variable Selection using Random Forest-based Recursive 
Feature Elimination 


In this study, Random Forests-based Recursive Feature Elimination (RF-RFE) is 
used. The RF-RFE algorithm, proposed by [6], was inspired from [9] introducing 
Recursive Feature Elimination algorithm for SVM (SVM-RFE). At the first step, the 
dataset is randomly split into a training set containing two thirds of the data and a 
validation set containing the remaining one third. The procedure fits the model to all 
explanatory variables using Random Forests. Then, the variables are ranked using 
their importance measure. The grouped VI is computed only on the training set. 
The less important predictor is eliminated, the model is refit and the performance is 
assessed by a prediction error computed on the validation set. The variable ranking 
and elimination is repeated until no variable remains. The final model is chosen 
by minimizing the prediction error. It should be noted that at each iteration, the 
predictors importance is recomputed on the model composed by the reduced set of 
explanatory variables. 

In the case of functional variables, the selection is performed using the algorithm 
on two different type of groups, thanks to the definition of importance of groups 
of variables. This allows to consider a group of variables as a whole, for example 
the group of the wavelet coefficients of a given signal, and to quantify its relative 
importance with respect to the other functional variables. 


3.2 Our procedure: Variable selection using iterative RF-RFE 


The proposed approach in this work aims to first eliminate the irrelevant physiolog- 
ical variables in the stress level classification task and then select among each kept 
variable the most relevant wavelet levels. In this study, the number of variables is 
very large (20480), compared to the number of the observations (68), thus the pro- 
cedure is not stable. In order to reduce the variability of the selection, the procedure 
is repeated 10 times. 
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3.3 Variable selection results 


The objective of variable selection is first to eliminate physiological signals that do 
not contribute significantly in the stress level classification, then for the retained 
physiological variables, the most relevant wavelet levels will be selected. 

When applying our procedure to the drivedb database, we perform at a first stage 
functional variables decomposition using the Haar wavelet which is considered as 
the simplest one. We pick 12 as the decomposition level which corresponds to the 
maximum level compatible with the 4096 = 2!? samples. 

To achieve this work, we use the R software, with the randomForest package 
proposed by [11] and RF groove packages developed by [8]. 

The proposed “blind” approach performs as the expert-based approach in terms 
of misclassification rate. This procedure offers moreover, additional information 
such as the physiological variables ranking according to their importance and the 
list of the relevant variables in stress level classification. The obtained results sug- 
gest that EMG and the HR are not very relevant when compared to the EDA and 
the respiration signals. This may help to investigate the list of physiological sensors 
that can be proposed to the smart vehicles designers, in order to determine the stress 
level. 


Acknowledgements Jean-Michel Poggi, and all the authors, thank the organizers for the invitation 
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A risk index to evaluate the criticality of a 
product defectiveness 


Un indice di rischio per variabili ordinali: 
un’applicazione nell’ambito del controllo della qualita 


Silvia Facchinetti and Silvia A. Osmetti 


Abstract We propose a risk index naturally suitable in quality-control framework 
characterized by data often collected on ordinal scale, to measure the risk of fail- 
ure of a product. A so-called Severity Index is defined on the basis of the relative 
frequencies of the ordinal variables. We examine the distribution and the statistical 
properties of its estimator. We apply the index to real data concerning the severity 
and the occurrence of defectiveness of the products of a multinational corporation 
manufacturer. Our index may be employed to communicate the level of risk, to com- 
pare among different risks and to identify interventions in the production system. 
Abstract Nel presente lavoro proponiamo una misura sintetica di rischio per 
variabili ordinali basata su frequenze relative. Tale indice risulta particolarmente 
adatto nell’ambito del controllo della qualità al fine di valutare il rischio di difet- 
tosità di un prodotto. In tale ambito infatti le informazioni disponibili sono spesso di 
natura ordinale. Noi analizziamo le caratteristiche dell’indice, proponiamo un suo 
stimatore corretto e consistente e studiamo la sua distribuzione asintotica. La nos- 
tra proposta viene applicata ai dati di una compagnia multinazionale, riguardanti 
la gravità delle diverse tipologie di difetti rilevati sulle componenti di un prodotto 
assemblato. L’indice calcolato consente di definire il livello di rischiosità dei com- 
ponenti utile per programmare opportuni interventi sul sistema produttivo. 


Key words: risk index; categorical variables; failure modes and effects analysis. 
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1 Introduction 


One of the ways to measure and communicate the level of risk is through risk in- 
dices, that are based on a synthesis of quantitative or qualitative information. For a 
discussion on risk measures see e.g. [5]. 

For calculating the risk, often the companies employ approaches based on categori- 
cal data expressed on ordinal scale that are improperly considered as quantitative. In 
this context we discuss on a synthetic measure of risk available for data expressed 
on ordinal scale, called Severity Index (S). It is defined on the basis of the relative 
frequencies of the ordinal variables. We study the distribution and the statistical 
properties of its estimator. 

This index appears naturally suitable to provide a measure of risk of failure of a 
product in quality-control framework in testing and recalling phases of products or 
in similar situations where the quality is expressed on ordinal scale. More precisely, 
our aim is to propose a synthetic priority of intervention indicator for a product, 
based on the frequencies of specific ordinal variables used to measure the quality 
and the reliability of such product. 

Other authors have proposed measures of risk for ordinal data. Figini and Giudici 
[2] propose a non parametric measure of operational risk for ordinal variables in a 
Bayesian framework. Figini et al. [3] use optimal scaling techniques to reduce the 
dimensionality of ordinal variables describing the service quality to a continuous 
score interpretable as a measure of operational risk. Cerchiello et al. [1] propose 
rank based models to asses perceived quality of academic teaching. 


2 The Severity Index 


Let X ~ {xj,pj} for j = 1,2,...,K be a categorical random variable (r.v.) repre- 
senting the level of quality/defectiveness of a product with ordered categories x; 
and probabilities p; = P(x;). We denote with Zx_1 = (pi, p2,.-+,Pjs-++5PK—1)s 

zy pj < 1 the parametric space of X. Let U ~ {u;= j,p;} for j = 1,2,...,K 
be a discrete stochastic variable corresponding to X with parametric space Px_1, 
whose expected value 


bu = Y jip;=K- Y (K- j)pj (1) 
L } 


is usually adopted as a measure of risk. 

If X represents growing levels of faulty then we are dealing with losses. Therefore 
Hu may be considered as a naive indicator of defectiveness of a product. 
Sometimes it may be necessary to swap from one approach to the opposite; in this 
case the expected value of the reverse discrete r.v. U* ~ 4 uj = (K+1)—uj, pj; ¢ for 


j=1,2,...,K should be adopted. We denote such expected value Severity Index (S) 
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of the categorical r.v. X: 


K-1 K 
S=(K+1)-4=1+Y (K-j)p;j= XF; (2) 
jal j=! 


where F; = Ei pı are the values of the cumulative distribution function of U for 
j= 1,2,...,K. S is based only on the cumulative probabilities of the ordinal vari- 
able X, and is expressed as function of K and the parametric space Px _ 1. It assumes 
values in [1, K]; the lower and the upper bounds occurs in the two situations of min- 
imum heterogeneity. 

The Severity Index proposed in (2) can be estimated by its empirical counterpart 
by using the empirical cumulative distribution function of X. Let (1,%2,...,%n) 
be a simple random sample of size n from the categorical variable X, and let 
(ii ,i2,..., ŭn) be the corresponding sample of discrete values from the stochas- 
tic variable U. 

The Severity Index estimator is defined as follows: 


o K-1 sa 
S=Lf=1+LK-)t. 6) 

j=l j=l 4 
É, = EL , 2 for j= 1,2,...,K, is the empirical cumulative distribution function, 


where r; is the number of the observations in the sample equal to the category xy, 
with 7; € N and yA, r =n. 

It is possible to show that the exact distribution of depends on the unknown values 
P1, P2,- --, pg. Consequently, in order to perform inferential procedures on S, that 
are robust with respect to the choice of p1, p2,..., px, it is possible to demonstrate 
that the Severity Index estimator is asymptotically normally distributed. Moreover, 
$ is an unbiased and consistent estimator for S. 


3 Application in quality-control framework 


We apply the proposed index to real data by a sales company of multinational cor- 
poration manufacturer of motion and control technology and systems providing 
precision-engineering solution for mobile, industrial and aerospace markets. The 
data concern information on severity and occurrence observed on potential failure 
of three components of a hose assembly (stripes, guard, hose). Severity is a measure 
of the gravity of a particular type of defect and occurrence is the frequency of the 
defects on n products. These information are typically available in companies that 
apply FMEA (failure modes and effects analysis)! to identify potential failures that 
could affect the customer’s expectations of product quality or process performance. 


l FMEA is a reliability tool of product or process analysis that is conducted to identify potential 
failures that could affect the customer’s expectations of product quality or process performance. 
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For recent discussions and studies on the risk measures in FMEA see, among other, 
[7, 4, 6]. 

For each component of the hose assembly, the operator observe the type of defect 
and its frequency. Then he classifies every defect by the level of severity described 
on a 3-point scale (H=High severity, M=Medium severity, L=Low severity?). The 
observed levels of severity and the corresponding frequencies are summarized in 
Table 1. We call the ordinal variable in Table 1 SEVERITY. 


Table 1 Frequencies for levels of Severity for the components of the hose assembly 


SEVERITY stripes guard hose 

H 0 0 0.58501 
M 0.34286 0 0.29971 
L 0.65714 1 0.11527 


Our aim is to measure the risk associated to each component by an index that 
summarizes the SEVERITY. We calculate the sample Severity Index $, according to 
(3), and its normalized version f € [0, 1]: 


, $-1 
f=—_, 4 
K-1° 4) 


and we provide the asymptotic confidence intervals for /. The results are reported in 
Table 2. 


Table 2 Point and 95% interval estimation for $ and Î 


INDEX stripes guard hose 

$ 1.34286 1 2.46974 
(1.18560; 1.50011) - (2.43331; 2.50618) 

i 0.17143 0 0.73487 
(0.01417; 0.32869) - (0.69844; 0.77131) 


We observe that ’hose” is the component with the highest level of risk. The other 
components have low level of risk. For ” guard” a situation of minimum heterogene- 
ity occurs: the index assumes its minimum value and Var($) = 0. 

These results may be very useful for the company to prioritize intervention on the 
business line of the hose assembly, also in terms of improvement of the related pro- 
cess control. 

Summarizing, the proposed index has been employed to communicate the level of 
risk, to compare among different risks in order to identify interventions on the pro- 


2 ”H” indicates that the gravity of the defect is very serious and ”L” not significant. 
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duction system or support decision making. Moreover, the index could be employed 
to understand how the risk change by monitoring it over time. 
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Exponential family graphical models and 
penalizations 


Modelli grafici basati su famiglie esponenziali e relative 
penalizzazioni 


Federico Ferraccioli, Livio Finos 


Abstract In this paper we focus on the semantics of undirected graphical model. We 
present a general specification based on exponential family distributions that allows 
great model flexibility and leads to consistent inferential procedures. The model is 
extended to include prior distributions on the parameters, that reduce the variance 
of the estimates and permit to avoid over parametrization. Particular attention is 
devoted to non-differentiable /, penalization, that leads to non-explicit gradient, for 
which we propose a new differentiable approximation. Experimental results and 
applications to large scale data are provided to demonstrate the increase in the rate 
of convergence and the variance reduction for different type of penalization priors. 


Abstract Questo lavoro si concentra su modelli probabilistici basati su grafici indi- 
retti. Viene presentata una specificazione generale, basata su distribuzioni apparte- 
nenti alla famiglia esponenziale, che permette grande flessibilità e conduce a stime 
consistenti. Il modello è dunque esteso introducendo vari tipi di distribuzioni a 
priori sui parametri, allo scopo di ridurre sia la varianza delle stime sia il rischio 
di sovra-parametrizzazione. Particolare attenzione è dedicata alla penalizzazione 
lı, per la quale viene proposta una nuova approssimazione differenziabile. Infine 
vengono presentati risultati e applicazioni a dati di larga scala, con l’intento di di- 
mostrare l’aumento del tasso di convergenza e la riduzione della varianza delle stime 
per i vari tipi di penalizzazioni. 


Key words: graphical models, exponential family, non-differentiable penalization, 
contrastive divergence, hierachical priors, topic modeling 
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1 Introduction 


Probabilistic graphical models are becoming a key part of the statistical modelling, 
particularly useful to deal with latent variables. Two major categories are Bayesian 
network and Markov random field, characterized by directed and udirected graph, 
respectively. In this paper we focus on this particular class of models, defined by 
undirected dependencies between observed and latent variables. From this perspec- 
tive we can think of the model as a tool for dimensionality reduction, factor analysis 
and clustering. The major advantage constists in specification of both observed and 
latent variables distribution as elements of exponential family [4]. The most im- 
portant disadvantage is the intractability of the partition function of the complete 
model, which complicates inference procedures using the likelihood. Here we focus 
our attention on a statistical specification of a method proposed by Hinton [2], the 
Contrastive Divergence, that greatly improve the efficiency of inference and opens 
the way for large scale problems. The properties of this method are still studied 
but the convergence in the case of exponential family distribution has been demon- 
strated in [5]. Using the properties of this set of probability distributions, we extend 
the general model introducing prior on the parameters: this extension permits the 
use of prior informations about the data and reduce the risk of over parametriza- 
tion, a major issue in large scale problems. Moreover it reduces the variance of the 
estimates and increase the rate of convergence. We concentrate our discussion on 
the choice of the prior distribution studying the problems related to the optimization 
procedures for both differentiable and non-differentiable cases. The optimization in 
the latter case is not trivial. The loss function is non-convex and proximal gradient 
methods are not applicable. Furthermore in the case of non-differentiable function 
the gradient is not explicit. Here we give a possible strategy that consist in replac- 
ing the non-differentiable penalization with a differentiable approximations. Exper- 
iment results and applications to large scale data are provided to demonstrate the 
increase in the rate of convergence and the variance reduction for different type of 
penalization prior. The objective is not to claim superiority of the models proposed 
in this paper, but to give a general and viable alternative to directed probabilistic 
model discussing the possible advantages and disadvantages. 


2 Exponential Family Model 


Let X = (X1,...,Xy) a vector of observed random variables and Z = (Z1,...,Zx) a 
vector of latent variables. We can choose from the exponential family N independent 
distributions for the observed variables and K independent distributions for the latent 
variables. The joint probability for the vectors X and Z are the following: 
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N 

px(x| 8) =] [ po(xi) exp (0'T(x)-A(0)) a) 
i=1 
K 

pela| 2) = [] aolzi)exp (20) -B(A)) (2) 


where {T(x),U(z)} are the vectors of sufficient statistics for the models, {9,4} are 
the natural parameters of the models and {A(@),B(A)} the log-partition functions. 


We are also interested on dependencies between observed and latent variables, 
so we introduce a quadratic interaction term. The joint probability of the complete 
model is the following: 


p(x,z) = exp (6°T() +ATU(z) + T(x)" QU(z)) 6) 


3 Estimate procedures with Contrastive Divergence 


The parameters estimation is done using an efficient method called Contrastive Di- 
vergence, that has the potential to greatly improve on the efficiency and reduce the 
variance of the estimates needed in the learning rule. The convergence of the method 
has been also proved for exponential family model in [5]. The idea is that instead of 
running the Gibbs sampler to its equilibrium distribution, we initialize Gibbs sam- 
plers on each data-vector and run them for only one (or a few) steps in parallel. 


In order to obtain the parameter estimates, let X = (X,...,Xy) an i.i.d sample 
generated from certain underlying distribution pg. 
Maximum likelihood estimation can be done by gradient ascent: 


N 
grew — 94 ag(0) = 04 a( 5 E Thx) ua) (4) 
i=1 


where the learning rate satisfy œ > 0. The first term AR T (xi) is easy to com- 
pute, but the second term u(0) is usually difficult to compute, since involves a 
complicated integral over X. To address this problem, Hinton [2] proposed the Con- 
trastive Divergence (CD) method. The idea of CD is to replace the second term with 
Hep(®) = + EN, T(x”), where xl is obtatined by a small number (m) of steps 
of an MCMC run starting from the observed sample X;. We can now derive the 
parameters update for our bipartite model as follow: 
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4 Hierarchical prior and penalization 


The main issue of this type of model, especially in the case of large scale problems, 
is the extremely high number of parameters. This leads to instability of the estimates 
and slows the rate of convergence. Given the likelihood of the model, we can think 
to add a prior over the parameters. A careful choice of the prior distribution permits 
to avoid over-parametrization, in particular due to the weights @; the prior reduces 
also the variance of the estimates, increasing the rate of convergence. Two natural 
choices for the parameters are the Laplace and the Gaussian prior distributions, that 
lead to /j and h regularization method, respectively. In the case of l, the gradient is 
explicit and we can easily derive the parameter updates as follows: 


new __ A le dB(A) Š dB(A) (m) 
® ora(1È al T (xi) iL a T (x;"") ee (5) 
with y the regularization parameter. Conversely, in the case of Laplace prior the non 
differentiability of absolute value leads to non explicit gradient. Furthermore the 
loss function is non-convex and proximal gradient methods are not applicable. An 
interesting way to solve this problem is given by a recent work on convolution based 
smooth approximations [3]. The smooth approximation to the non-differentianble 
absolute value function is computed via convolution with a Gaussian function as 


below: 
202 2 
go) = rert 3) + 4/ Z oo 5) (6) 


where erf(t) = i fo exp(—u?) du is the standard error function and 0° an hyperpa- 


rameter. The advantage of using this approximation is that ĝo (t), unlike the absolute 
value function, is a smooth function and we can compute its gradient. Furthermore, 
$o(t) converges uniformly to the absolute value function as o — 0 with a higher 
rate of convergence than other approximations like square root. We can now add the 
gradient of the penalization term to the parameters update: 
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RSM Gaussian Laplace RSM Gaussian Laplace 
Variance 1.106 0.911 0.974 Variance 39.399 27.272 30.527 
Gaussian Prior Laplace Prior 


“= Standard RSM = Standard RSM 
h.=0.001 


+1-001 hod 
+01 2-05 


Perplexity 
Perplexity 


Epoch Epoch 


Fig. 1: Convergence plot for different value of the hyperparameter 2. The graph on the left cor- 
responds to the Gaussian prior, the graph on the right to the Laplace prior with convolution based 
approximation. In both cases the penalization lead to faster convergence rate for all the hyperpa- 
rameter values, reducing the variance of w, the estimated weights. 


1 OB(A)) 
di 


) 


new __ T M erf d 
(x) o T: 


(7) 


(0) 


T (xi) 


i=l 


Without losing the asymptotic consistency, it is possible to use an annealed ver- 
sion of the learning rate a = 1, that increases the rate of convergence to \/n. 


5 Application 


In this section we present experimental results of the proposed prior penalization 
method applied to the Replicated Softmax (RSM)[1]. This is a particular case of 
the exponential family model proposed. The model can be view as the undirected 
counterpart of the Latent Dirichlet Allocation, one of the most used method in topic 
modeling. The basic idea is that documents can be represented as random mixtures 
over latent topics, where each topic is characterized by a distribution over words. We 
use the NIPS proceedings papers dataset, that contains about 400 documents with a 
dictionary of more than 57.000 words. After several tests, we decided to use 10 vari- 
ables in the hidden layer. To speed up the learning we used stochastic mini-batch: 
instead of computing the updates for all the observations, we permuted and divided 
the data in subsets of dimension m = 40. In presence of large scale problems the 
mini-batches help the parallel computation and the parameters convergence. Learn- 
ing was carried out using Contrastive Divergence with ten full Gibbs step with loss 
function defined as: 


410 Federico Ferraccioli, Livio Finos 


Perplexity = exp| — £ —log p(xn) (8) 


with D, the number of words in the n-th document and log-likelihood log p(xn) 
estimated as described in Section 3. We compare the standard model without pe- 
nalization (Replicated Softmax), with three value of the hyperparameter À both for 
Gaussian and Laplace priors (figure 1). The rate of convergence is improved for both 
type of prior and in particular higher values of the hyperparameter À lead to lower 
values of the loss function. 


6 Conclusion 


We develop a general framework for probabilistic graphical models, focusing our 
attention on formal definition of the Contrastive Divergence method. This greatly 
improve the efficiency of inference and opens the way for large scale problems. We 
also introduce hierarchical prior distributions on parameters to reduce the variance 
of the estimates and to increase the rate of convergence. We concentrate our dis- 
cussion on the choice of the prior distribution studying the problems related to the 
optimization procedures for both differentiable and non-differentiable cases. Exper- 
iment results and applications to large scale data confirm the increase in the rate 
of convergence and the variance reduction for different type of penalization prior 
presented. Future work will concentrate on the model specification for an arbitrary 
number of layers, in order to capture hierarchical dependencies. 
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Key-indicators for maternity hospitals and 
newborn readmission in Sicily 
Un sistema di indicatori per la classificazione dei punti 


nascita e re-ricoveri neonatali in Sicilia 


Mauro Ferrante, Giovanna Fantaci, Anna Maria Parroco, Anna Maria Milito, 
Salvatore Scondotto 


Abstract This paper proposes a composite indicator for the classification of 
maternity units, which takes into account for the different dimensions of service 
delivery, as potential predictors of health outcomes. As a measure of outcome, infant 
readmissions is considered, being a proxy of morbidity. The results highlight that 
after controlling for risk factors of the newborn, and for the presence of neonatal 
intensive unit, infants born in lower level hospitals show readmission rates higher 
than infants born in higher level hospitals. 

Riassunto // presente articolo propone un indicatore composito per la 
classificazione dei punti nascita che tenga conto delle diverse dimensioni coinvolte 
nella qualità deli servizio erogato, quali potenziali predittori di esiti di salute. Come 
misura di esito vengono presi in esame i re-ricoveri neonatali, quale spia di 
possibili complicanze. I risultati, controllando per la presenza di unità di terapia 
intensive neonatale ed altri fattori di rischio, mostrano tassi di riammissione più 
elevate in strutture di basso livello, sulla base della classificazione proposta. 

Key words: Composite indicator, birth-at-risk, healthcare evaluation, 
regionalization. 
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1 Introduction 


The relevance and quality of hospitals depend on various factors, among which the 
delivery volume, the geographic location, and the public or private ownership 
represent only a small component [2]. In regionalization programs the most used 
index to evaluate the quality of service delivery is given by the volumes of activity 
[4,7]. These programs generally aim at creating centers of excellence at which 
patients decide to receive care. Examples of this kind may be found in cancer care or 
complex survey procedures [6]. Nonetheless, there are situations, for example those 
related to acute situations for acute myocardial infarction [3] or in perinatal care 
[10], in which delivery volume alone is not able to provide an adequate picture of the 
complexity of factors which need to be evaluated in orienting regional healthcare 
programs. 

Within the frame of maternity hospitals regionalization programs, the National 
Health and Medical Research Council recommend that pregnancies less than 33 
week gestation be delivered at hospitals with neonatal intensive unit, in order to 
reduce the risk of mortality and morbidity. However, despite several studies have 
demonstrated better outcomes in high-level hospitals, with neonatal intensive unit 
and with high levels of delivery volume, for birth-at-risk (e.g. pre-term births) [1,9], 
less marked seem to be the differences in the case of low- or no risk childbirths 
[5,11]. Moreover, regionalization programs in perinatal care in Europe and North 
America, have determined a decrease in maternity hospitals with direct consequences 
also in terms of travel times required to reach the hospital [8]. This phenomenon 
determined a greater attention on the geographic location of maternity hospitals and, 
more in general, on issues related to maternity hospital accessibility. 

Starting from these considerations, this Paper proposes a composite indicator for 
monitoring maternity hospitals in order to assist regionalization programs and to 
perform a classification of maternity hospitals in Sicily based on the proposed 
indicator. In order to evaluate the relationship between the proposed indicator and 
newborn readmission within 30 days from the childbirth event and hospital’s 
category is performed. 


2 Materials and methods 


For the purposes of the present study several information sources were considered. 
The main information source is represented by Birth Certificates Records (CeDAP), 
which represent a unique information source for obtaining information on childbirth- 
related characteristics of both the mother and the infant. As a second information 
sources, Hospital Discharge Records have been used which, through an integration 
with CeDAP, allowed for the reconstruction of some characteristics of the childbirth 
event, e.g. transfers, complications, and readmissions. Finally, distance matrix 
among Sicilian municipalities has been considered in order to determine travel time 
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among the mother’s Municipality of residence and the Hospital Municipality. In 
order to construct a composite indicator for maternity hospitals classification, several 
dimensions have been considered and for each dimension a set of variables have 
been selected, as reported in Table1. 

For the delivery volumes dimension, both the total number of childbirths and the 
caesarian section rates have been considered, given the relevance that both these 
aspects represent within the actual national evaluation system of quality of care. 
However, in order to take into account also for the degree of complexity of the 
childbirth event, also caesarian section rates in Robson’s categories 01 and 03 have 
has been included in the indicator, since the presence of a caesarian section 
childbirth may be a signal of inappropriateness. The third dimension considered 
relates to the territorial basin of the hospital, where the set of municipalities with at 
least 5% of its childbirths in the selected hospital defines the basin of the considered 
maternity hospital. For this dimension, the composite indicator includes both the 
total number of childbirths of women residing in the hospital’s basin municipalities 
and the share of basin’s childbirths in the hospital over the total number of basin’s 
childbirths. As for the travel time dimension, the average travel time of total 
childbirths in the hospital has been included. Finally, for the transfer dimension, the 
share of childbirths with transfers within one day of both the mother and the newborn 
have been considered. 


Table 1: Dimension, variables direction and weights of the composite indicator for maternity uni 
classification. 


Dimension | Variables Direction Weight 
Total number of newborns (V1) (+) 0.10 
Volumes 
% of caesarean section births (V2) (-) 0.10 
% of caesarean section births in Robson = 01 or 03 O 0.10 
Complexity | (C1) g 
Total newborns of women living in hospital’s basin 
ECT (+) 0.10 
Basin municipalities (B1) 
% of basin’s birth in the hospital (B2) (+) 0.15 
Average travel time from municipality of residence © 0.15 


Travel time | and hospital’s municipality (T1) 

% of transfers of the mother within one day from 
the birth (TR1) O 
% of transfers of the newborn within one day from 
the birth (TR2) 


Transfer 


Once derived these information, a standardization procedure has been used for 
variable transformation, and the elementary indices have been aggregated through a 
weighted average, with the weights system indicated in Table 1. Weights have been 
derived after a sensitivity analysis with different weights systems and through 
technical meeting with experts. The produced indicator allowed for a classification 
of maternity hospitals into three categories. Finally, in order to evaluate the 
association between maternity unit’s category and birth outcome, multiple logistic 
regression analyses were performed in which readmission of newborns has been 
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considered as an outcome measure, as a function of the proposed classification, and 
of other factors, such as the presence of neonatal intensive care unit and by 
controlling for other childbirth-related risk factors. For the definition of childbirth- 
at-risk the following criteria have been applied: mother’s age below 20 or above 40 
years old; very low birth weight (<1500 gr); gestational age lower than 37 weeks; 
multiple childbirth; small for gestational age (SGA). 


3 Results 


In 2014 a total of 44,436 newborns have been delivered in the 56 maternity hospitals 
of Sicily region. Nonetheless, after the application of the exclusion criteria, and 
record-linkage between Birth Certificate Records and Hospital Discharge Records, 
the valid cases resulted equal to 34,861.The average volume of activity resulted 
equal to 794, with a high degree of variability, with a minimum of 13 newborns in 
the case of Lipari Island hospital, and a maximum of 2538. Three hospitals have not 
been included in the analysis having less than 10 newborns in the year considered. A 
rather high degree of variability appeared also for all the other dimensions 
considered in the composite indicator. 

In Figure 1 the relationship among delivery volumes, the score for the composite 
indicator proposed and the current regional classification are analyzed. Moreover, 
dashed lines indicates the cut points chosen for the categorization of the proposed 
indicator into three categories. It can be seen that, despite there is a direct 
relationship between delivery volumes and the proposed indicator, there are 
differences in terms of maternity units classification between the proposed 
classification and the current regional classification. 


T T T T T 
0 500 1000 1500 2000 2500 
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® Not Classified = Level l 
A Level Il 


Figure 1: Delivery volumes, regional classification of maternity units, and composite indicator score, in 
Sicily, year 2014. 
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In Table 2 the results of logistic regression model are reported, in which the 
outcome variable assumes value 1 if the newborn has been readmitted within 30 days 
after the childbirth, and 0 elsewhere. In order to control for some risk-factors of the 
newborn and for the degree of severity, the presence of neonatal intensive care unit 
and other childbirth-related risk factors have been considered. 


Table 2: Results of logistic regression model for infant readmission within 30 days after the childbirth 
by composite indicator categories, birth-at-risk and presence of neonatal intensive unit in the hospital 


Variable | Category OR Inf CI 95% | Sup CI95% 
High (Reference) 
Composite | tedium 0.903 0.764 1.068 
Indicator 
Low 1.320% 1.065 1.636 
Newborn-at-risk= No (Reference) 
Risk 
Newborn-at-risk = Yes 1.483** 1.267 1.736 
Neonatal intensive unit = No (Reference) 
UTIN 
Neonatal intensive unit = Yes 4.431°" 3.677 5.339 
Constant 0.011** 0.009 0.013 


FR 


-and *** indicate significance level at 0.1, 0.05 and 0.01 respectively. 


From the analysis of the results in Table 2, a strong direct association between 
both the presence of risk factor (OR=1.48) and the presence of the Neonatal 
Intensive Unit (OR=4.43) with infant readmission within 30 days after the birth 
event appears. Nonetheless, after adjusting for these factors, low-category maternity 
units show a higher risk of readmission compared to births happened in high- and 
medium-level hospitals (OR=1.320). By considering that more complicated 
situations should be assisted by higher-level hospitals, this result calls for monitoring 
actions aimed at exploring the reasons of these readmissions in low-level maternity 
units. Moreover, the low-level category of the proposed classification comprises not 
only low-volumes maternity units, but also level I and one level II maternity units, 
according with the current regional classification criterion. 


4 Discussion and conclusion 


The classification of maternity units represents an important issue, not only from 
the scientific perspective, related to the analysis of the determinants of health 
outcomes, but especially from the health policy perspective [2]. It has been showed 
that several elements such as geographic location, transfers, post-partum length of 
stay, and childbirth risk factors, should be considered in evaluating the quality of 
service delivery [1,5,8,9]. The proposed approach tries to overcome some of the 
limits of the current classification system of maternity units, which is based mainly 
on delivery volumes. The proposed indicator takes into account for a number of 
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these dimensions and it may constitute the basis for classifying maternity units into 
different levels based on a set of dimensions which take into account for both some 
volume-related aspects of service delivery, and for some quality and territorial 
aspects related to the childbirth event. 

The results showed that the proposed classification may predict comorbidity 
conditions, as measured by 30-days infant readmissions after the birth event. Further 
research is required to better highlight the determinants of these readmission, jointly 
with the analysis of the relationship between the proposed classification, and other 
potential proxies of quality of care (e.g. mother readmissions, mother and infant 
mortality). 
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Change of Variables theorem to fit Bimodal 
Distributions 


Il teorema del Cambio di Variabile per modellare 
distribuzioni Bimodali 


Ferretti Camilla, Ganugi Piero, and Zammori Francesco 


Abstract Bimodality is observed in empirical distributions of variables related 
to materials (glass resistance), companies (productivity) and natural phenomena 
(geyser eruption). Our proposal for modeling bimodality exploits the change of 
variables theorem requiring the choice of a generating density function which rep- 
resents the main features of the phenomena under analysis, and the choice of the 
transforming function @(x) that describes the observed departure from the expected 
behaviour. The novelty of this work consists in putting attention to the choice of 
@(x) in two different cases: when bimodality arises from a slight departure from 
unimodality and when it is a proper structural feature of the variable under study. 
As an example we use the R ”geyser” dataset. 


Abstract La bimodalita é osservata in distribuzioni empiriche legate alle proprieta 
dei materiali (resistenza del vetro), alla produttivita delle imprese e a fenomeni nat- 
urali (eruzione di geyser). La nostra proposta per modellare distribuzioni bimodali 
si basa sul teorema del cambio di variabile, il quale richiede la selezione di una 
distribuzione generatrice che formalizza le caratteristiche strutturali del fenomeno 
in esame, e di una funzione trasformatrice @(x) che descrive la distorsione osser- 
vata nei dati rispetto al comportamento atteso. La novita di questo lavoro consiste 
nello scegliere con particolare attenzione la funzione @(x) in due casi differenti: 
quando la bimodalità é causata da una lieve distorsione di un fenomeno altrimenti 
unimodale, e quando essa è invece una caratteristica strutturale della variabile in 
esame. Come esempio utilizzeremo il dataset ” geyser” di R. 


Key words: bimodal density function, change of variables theorem. 
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1 The problem 


Bimodal distributions are observed in datasets arising from different research fields, 
for instance: 1) the glass resistance (to a given amount of pressure); 2) the firm 
productivity measured in number of units produced per unit of time (e.g. the Tuscan 
CAAF as in [1]); 3) the default probability for rated firms ([9]); 4) the waiting time 
among consequent eruptions of the Old Faithful geyser taken from the ’’geyser” 
dataset contained in the R package MASS ([10]). 

Bimodality has been treated chiefly using the mixture approach. Mixtures repre- 
sent a suitable model when two or more distinct groups with specific distributions 
are considered as a single set. Conversely, when original groups composing the mix- 
ture are not recognizable or the specific density function of each group is not known, 
or also when the observed phenomena is structurally bimodal, mixtures could not 
provide a good data fitting. Consequently, it is worth using an alternative approach 
to obtain a bimodal distribution for fitting the original data. 


1.1 Change of variables theorem and choice of generating and 
transforming functions 


The basic tool for modeling bimodality regards the use of the well-known change 
of variable theorem ([3, 5], among others): 


Theorem 1. Given a continuous r.v. Y with known density function f(y), and given 
a transformed rv. X = h(Y) with h(-) monotone, the density function of X has the 
following formula: 


a(x) = Lose) a) 


Let X be the variable under study, having unknown density function. Such theo- 
rem permits to obtain a formula for the density function g(-) through the following 
steps: 


1. Choice of a suitable r.v. Y whose density function f(y) is known and represents 
the main structural features of the phenomena under analysis. The function f is 
called the generating density function. 

2. Choice of a transformation function Y = @(X) describing the relationship be- 
tween the observed data and the expected theoretical behavior given by f(y). 


The theorem will be then applied on X = h(Y) = o~! (Y). 

This procedure has been used for proposing the fitting of unimodal data ([7]), 
assuming that the generating function coincides with the Standard Normal Z with 
density function @(z), and three possible data transformations are proposed as in the 
following table: 
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Table 1 Three choices for g(x) proposed in [7] and the resulting density function g(x). 


Q(x) Density function of X 
z = ln(x), x € (0, +00) g(x) = Le 200 
z=In (75) .¥€ (0,1) ce D 
1 EA = 
= 2 = 1 . -4 [no+ x +] 
z=In(x+vx+1),x ER g(x) Vani) e 


Summarizing, in [7] it is assumed that the behavior of observed data is due to a 
departure from Normality which is formally described by the transformation @(x). 
Besides, the three transformation functions in Tab. 1 allows to model almost all the 
unimodal density function encountered in empirical applications. 


2 Analyzing bimodal distributions 


The novelty of this work consists in applying the mentioned procedure to the case 
of data displaying a bimodal behavior. According to this, we stress the following 
facts: 


1. The choice of the generating function is related to the inner structure in the ob- 
served variable. If we choose an unimodal generating function (e.g. the Normal 
function) we are actually assuming that bimodality is slight and due only to some 
kind of moderate perturbation affecting the unimodal underlying model. 

2. On the other side, if the variable is known to be structurally bimodal, the suitable 
choice is a bimodal generating function. In this case, the main problem is the 
poor menu of existing bimodal density functions. 

3. Given the generating function, it is necessary to find a suitable transformation 
function @(-). The choice of @ is fundamental to bring back observed data with 
the expected theoretical model. In literature, we observe mainly two approaches 
for the choice of @(x): 


a. The regression analysis applied on the Q-Q plot of the observed percentiles 
w.r.t. the theoretical percentiles from the generating function as in [11]. 

b. The production of a differential equation whose solutions form a family of 
suitable transformation functions as in [4, 8]. 


3 Bimodality deriving from a Normal generating function 


As a first step, we assume that the empirical variable is structurally unimodal, and 
we choose the Standard Normal distribution as generating function. Fig. 1 displays 
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a simulated example illustrating the form we expect for the transformation func- 
tion. Indeed, the bimodality can be interpreted as a polarization of statistical units, 
originally located close to the median of the distribution, and shifting in opposite 
directions. In this light, a suitable transformation function should have a reversed-S 
shape as illustrated in Fig. 1 (bottom left panel). 


g(x) 


Fig. 1 Graphical representa- 
tion of the transformation of 
bimodal data (upper panel) 
to obtain an unimodal (Nor- 
mal) distribution (lower right 
panel), through reversed-S- 
shaped transformation func- 
tion. 


4 Empirical illustration 


As an empirical example, we use the dataset ” geyser” contained in the R package 
MASS. Data regard the observed values (in minutes) of the Old Faithful geyser 
eruptions duration, and the waiting time between two consecutive eruptions. Both 
the variables show bimodality, as shown by the histograms based on 25 classes in 
Figg. 3 and 4. The analysis of the Q-Q plot obtained comparing the percentiles ob- 
served using the given variables in comparison with the Normal percentiles confirms 
the reversed-S shaped transformation as explained in the previous section, for both 
duration” and ’ waiting” (see Fig. 2). 

As a first attempt, we choose @(x) = Ax? + Bx? + Cx + D for both the variables, 
given the flexibility of third-order polynomials, and we estimate parameters using 
OLS. We stress the necessity to improve the fitting of the Q-Q plot. In particular we 
note that prominent bimodality as in the ’duration” variable is more challenging due 
to the lack of observations near the 50th percentile. With this aim, we will follow 
two approaches: 


1. We substitute polynomial regression with third or fourth-order B-spline regres- 
sion ([2]) which should improve the fitting and make possible to impose mono- 
tonicity constraints on @(-), as required by the Theorem 1, and avoid collinearity 
drawbacks that instead affect the polynomial regression. 

2. Froma theoretical point of view, we aim to supply a differential equation for con- 
structing a family of reversed-S-shaped transformation functions. The advantage 
of this choice consists in the possibility to make easier the economic or physical 
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interpretation of the parameters with respect to the spline regression approach, 
and with this, the explanation of the hidden causes of bimodality. 


5 Conclusions and further research 


In this work we focus on the problem of modeling bimodal data. We face the prob- 
lem using the change of variable theorem, which requires a suitable transformation 
function such that the transformed data have a known shape. As an example, we 
apply this procedure to geyser data, choosing a Normal generating function and a 
polynomial transformation. The fit we obtain is not yet satisfactory, we aim then 
to improve the procedure using the B-spline regression. Alternatively we will try 
to produce a family of transformation functions, all of them mirroring a seemingly 
statistical regularity (the reversed-S-shaped Q-Q plot associated to bimodality). The 
advantage of the second approach is the possibility to make easier economic or 
physical interpretation to parameters. 

As a further research step, we aim to choose a bimodal generating function. Since 
research on theoretical bimodal distributions is scarce until now (see for example 
some recent work as [6]), this choice will require the production, ex-novo, of a 
suitable bimodal density function to be used as structural model. 
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Space-time clustering for identifying population 
patterns from smartphone data 


Clustering spazio-temporale per dati smartphone sulla 
distribuzione della popolazione 


Francesco Finazzi and Lucia Paci 


Abstract In this work we aim at studying spatio-temporal patterns of the popula- 
tion movement across a large city. We exploit the information on people position 
collected by the smartphone application of the Earthquake Network project and we 
adopt a dynamic model-based clustering approach to identify the patterns. The ap- 
proach is applied to smartphone data collected in Santiago (Chile) over the period 
February-April 2016. Some preliminary results are presented and discussed. 
Abstract L’obiettivo di questo lavoro è studiare i pattern spazio-temporali di movi- 
mento della popolazione su una grande città. Sfruttiamo l’informazione sulla po- 
sizione delle persone raccolta dall’applicazione smartphone del progetto Earth- 
quake Network ed applichiamo un approccio di clustering dinamico per identificare 
i gruppi. L'approccio è applicato ai dati smartphone raccolti per la città di Santi- 
ago (Cile) lungo il periodo febbraio-aprile 2016. Alcuni risultati preliminari sono 
presentati e discussi. 


Key words: Finite mixture models, Markov chain Monte Carlo, spatio-temporal 
modeling, state-space, crowd-sourcing data 


1 Introduction 


Detecting population dynamics over short periods (e.g. daily movements) may pro- 
vide the public with useful information to improve traffic infrastructure associated 
with spatio-temporal commuting patterns, upgrade accessibility or attractiveness of 
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areas interested by less people than others, enhance public transportation according 
to infrastructure/open space utilization. Indeed, population patterns are character- 
ized by drastic changes during the day according to several activities such as educ- 
tion, working, recreation, visiting and shopping activities, among others. 

Customary, population studies are based on census data that do not allow to cap- 
ture population movements in short periods. Rather, mobile-based data collected 
over a given region at high temporal scale offers new opportunities to study popula- 
tion distribution and movement patterns over such region. For instance, Secchi et al 
(2015) proposed a non-parametric method for the analysis of spatially dependent 
functional mobile network data to identify subregions of the metropolitan area of 
Milan sharing a similar pattern along time, and possibly related to activities taking 
place in specific locations and/or times within the city. 

Alternatively, we can identify potential partitions of the space and study their 
evolution over time to extract useful and concise information from smartphone- 
based data that is helpful to investigate population dynamics. Recently, Paci and 
Finazzi (2017) proposed a model-based approach to identify clusters in data col- 
lected at fixed spatial locations and time steps. Within finite mixture modeling, 
spatio-temporally varying mixing weights are introduced to allocate observations 
at nearby locations and consecutive time points with similar cluster’s membership 
probabilities. As a result, a clustering varying over time and space is accomplished. 
Conditionally on the cluster’s membership, a state-space model is deployed to de- 
scribe the temporal evolution of the sites belonging to each group. 

In this work we employ the dynamic space-time clustering approach to explore 
population dynamics and motion patterns over the city of Santiago (Chile) using 
data coming from the Earthquake Network project (www. earthquakenetwork. 
it). The project implements a crowdsourced earthquake early warning system 
based on smartphones networks (Finazzi and Fassò, 2016) and it requires to col- 
lect the precise location in space of smartphones at regular time steps. Here, it is 
assumed that the smartphone location is also the position in space of its owner. 


2 Bayesian space-time mixture modeling 


Let y;(s) be a response variable observed at time ż (t = 1,...,7) and location s € R?. 
We assume that observation y;(s) comes from a finite mixture model, that is 


K 
f (ie(s) | 2,0) = X ml) f O(s) | Ox) (1) 
k=1 


where K is the number of components. The distribution under the k-th component 
(k=1,...,K) is denoted by f(- | ©) where f is a density function of specified form 
and ©, denotes the set of parameters of each component distribution. The mixing 
probability 7, (s) is the probability that the location s belongs to component k at 
time t and it satisfies 7, (s) > 0 with ye T (S) = 1 for each s and t. 
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As usual in Bayesian analysis, a hierarchical formulation of the mixture model is 
exploited to facilitate the computation. For each observation, we introduce a latent 
allocation variable, w,(s), that identifies the component membership of y,(s), that 
is Pr(w;(s) = k) = 7% (s). In other words, we assume that the allocation variables 
w;(s) are conditionally independently distributed given 7, (s) and they come from 
a multinomial distribution. Given the latent w,(s), the observations y,(s) are inde- 
pendent with f (y; (s) | w: (s) =k,@) = f (y: (s) | ©). As customary in model-based 
clustering, we interpret each mixture component as a cluster, such that observations 
are partitioned into mutually exclusive K groups. 

The mixing probabilities, 7; (s), are allowed to vary from observation to obser- 
vation, i.e., across space and over time. Space-time dependence in the observations 
is introduced through the prior distribution of the weights such that observations 
corresponding to nearby locations and consecutive time points are more likely to 
have similar allocation probabilities than observations that are far apart in space and 
time. For each location s and time f, the weights take the form 


exp (x;(s)B, + 9.4(5)) 
Lf exp (xt(s) B; + $11(S)) 


where x, (s) is a p x 1 vector of covariates, ¢; (s) are spatio-temporal random ef- 
fects and 8, = 0 and ¢, ı (s) = 0 (t = 1,...,7) to ensure identifiability. The logistic- 
type transformation in (2) guarantees that the two conditions mentioned in Section 
2 are satisfied (Fernandez and Green, 2002). When available, covariates may help 
in predicting group membership’s probabilities while random effects provide ad- 
justment in space and time to the explanation provided by covariates. Therefore, 
the response distribution is allowed to vary in flexible ways across time, space and 
covariate profiles. 

To allow for dynamics over time and dependence over space we assume, for 
k=2,...,K, 


(2) 


Tk (s) = 


dr x(8) = PrGr—1,4(S) + Grels) (3) 


where 6,x(s) are independent-in-time spatially correlated errors coming from a 
zero-mean Gaussian Process (GP) equipped with an exponential spatial covariance 
function. Although the K — 1 spatio-temporal random effects @; (s) are assumed to 
be independent, the corresponding weights are not independent given their definition 
in (2). The space-time structure of random effects @, x(s) allows to borrow strength 
information from nearby sites and consecutive time steps. As a result, similar out- 
comes at near space and time points are assigned with similar cluster membership’s 
probabilities. 

Model (1) requires the specification of the sampling density f (y;(s) | Ox). The 
approach pursued in this work is based on dynamic linear modeling, often referred to 
as state-space models. In particular, we assume a dynamic linear model to describe 
the temporal dynamic evolution of all the sites within component k. 

Let y; = (9;(S1),.--,3/(Sn))’ be the n x 1 observation vector at time 7, where n is 
the number of locations. Conditionally on the allocation variables, the space-state 
model is provided by 
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y= Hz+€ (4) 
4 = Gz-1 +N, (5) 
where 2 = (z1,... 2K). is the K x 1 state vector, H; is a n x K matrix defined 


below, and G is a K x K stable transition matrix. Finally, €; ~ N (0,027,) is the 
n x | measurement error vector and N, ~ N(0, £n) is the K x 1 innovation vector. 

We now turn to matrix H,. Suppose that site s belongs to component k at time 
t. Then, the i-th row of matrix H; contains a single element equal to one at posi- 
tion k, while all the other elements are filled with zeros (Inoue et al, 2007; Finazzi 
et al, 2015). Note that, the one-zero structure of matrix H, is allowed to vary over 
time according to mixing probabilities 7; (s). Also, we benefit from the borrowing 
strength of information of all sites belonging to component k at time ¢, since they all 
contribute in estimating the common latent state z; x. Given the specification in (5), 
the desired temporal pattern of cluster k is represented by latent state z; x. 

Fully inference is provided under a Bayesian framework. The hierarchy of the 
model is completed by independent noninformative prior distributions for all the hy- 
perparameters and Monte Carlo Markov Chain (MCMC) algorithms are employed 
to approximate the joint posterior distribution; see Paci and Finazzi (2017) for all 
fitting details and posterior computation. Model fitting is carried out using the MAT- 
LAB code DYSC available online at the web page https://github.com/ 
graspa-group/DYSC. 


3 Analysis of smartphone data 


Smartphones taking part in the Earthquake Network project send a heartbeat signal 
to a central server every around 30 minutes. Signals include the geographic location 
of the smartphones that is used to estimate the state of the network at any given time. 

In this work, we exploit the information carried by the heartbeat signals to study 
the population movement across the city of Santiago. We consider 24/900 smart- 
phones and we assume they are representative of the entire Santiago population. We 
partition the city of Santiago into a uniform lattice of N = 354 sites and for each site 
we consider the number of signals on a hourly basis. For each hour of the day, we 
aggregate signals observed over the period February-April 2016, assuming that the 
daily motion patterns of the population are stable over the 3 months. Moreover, we 
distinguish between working days and weekend in order to investigate possible dif- 
ferences. The aggregation leads to two N x T matrices for the working days and the 
weekend, respectively, with T = 24. Since we aim at studying the motion patterns 
independently from the number of signals, we standardize each time series with re- 
spect to its own mean and variance. This implies that sites are directly comparable. 
Hence, at each time step, the time series is interpreted in the following way: a neg- 
ative value means that the number of signals coming from the site is below the site 
average, while a positive value means that the number of signals is above average. 
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Figure | shows the standardized number of signals received from each site during 
working days (left panel) and weekend (right panel) over the study period. 

At each time step, thus, we apply model described in Section 2 to cluster sites 
which behave in a similar way with respect to their average and then we explore 
how the clusters evolve over the 24 hours of the day. We employ the diagnostic tool 
provided by Paci and Finazzi (2017) to select the number of clusters. The analysis 
suggests that only two clusters are needed. This is a consequence of the fact that 
time series are standardized and the number of signals from each sites can be either 
below or above average. Figure 2 shows the Posterior 95% credible interval of the 
temporal patterns z, x for working day signals (left panel) and weekend signals (right 
panel), where each temporal pattern is related to a cluster. During working days, the 
separation between the temporal patterns is lower at 7 a.m. and 7 p.m., namely when 
people commute from home to work and vice-versa. During these hours, signals are 
more evenly distributed across city than in any other hour of the day. During the 
weekend, the same effect can be found at 10 a.m. and at midnight. 

To provide the clustering, we assign each observations to their most likely group 
according to the maximum a posteriori probability (MAP) rule. In Figure 3 cluster- 
ing result can be appreciated for 12 a.m., 8 a.m. and 8 p.m. and for both working 
days and the weekend. For any given hour of the day, blue and red points are sites 
with a number of signals below and above the average, respectively. During working 
days, the number of signals from the city center is below average at night and above 
average during the day. This pattern is disrupted during the weekend when people 
tend to move later in the morning and to return home later in the night. 


Fig. 1 Number of signals collected from each cell during working days (left panel) and weekend 
(right panel) over the period February-April, 2016. 
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Fig. 2 Posterior 95% credible interval of the temporal patterns z; x for working day signals (left 
panel) and weekend signals (right panel). 
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Fig. 3 Clustering result for working day (top row) and the weekend (bottom row) at 12 a.m. (left 
column), 8 a.m. (middle column) and 8 p.m. (right column). Blue and red dots refer to the blue and 
red temporal patterns in Figure 2, i.e., below and above the average, respectively. 
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IT Solutions for Analyzing Large-Scale Statistical 
Datasets: Scanner Data for CPI 


Soluzioni IT per l’analisi di dataset statistici di grandi 


dimensioni: scanner data per l’indice dei prezzi al consumo 


Annunziata Fiore, Antonella Simone and Antonino Virgillito 


Abstract In this paper we present the issues and challenges related to dealing with 
datasets of big size such as those involved in the Scanner Data project at Istat. We 
describe the IT solutions introduced as part of a larger scope approach to the 
modernisation of tools and techniques used for data storage and processing in Istat, 
envisioning the future challenges posed by Big Data and Data Science in NSIs. We 
show how the IT architecture can help the methodological choices for the construction 
of consumer prices microindices by comparing different approaches to compute 
indices through an extensive analysis carried out over the entire data set. 


Abstract In questo paper vengono discusse le sfide legate all’utilizzo di dataset di 
grandi dimensioni come quelli nel progetto Scanner Data, attualmente in corso presso 
l’Istat. Vengono illustrate le soluzioni tecnologiche introdotte come parte di un 
approccio di portata più ampia alla modernizzazione dei tool e delle tecniche di 
memorizzazione e elaborazione dati in Istat. Mostriamo come la nostra architettura 
IT può sostenere le scelte metodologiche per la costruzione dei microindici dei prezzi 
al consumo. In particolare, presentiamo i risultati di una analisi, effettuata sull’intero 
dataset, mirata al confronto di diversi approcci al calcolo dell’indice. 
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1 Introduction 


The Istat Scanner Data project started in 2015 [1] and it is currently going through a 
pre-production phase that will continue along the whole 2017. The objective of the 
project is to carry out a massive revision of the production process of Consumer Prices 
Indices in order to replace the on-field data collection for grocery products in 
supermarkets with the prices obtained from the scanner data source [3]. Once in 
production (scheduled for 2018), scanner data will be the first example at Istat of such 
a large data set being used as a source a production process. 

A number of challenges are involved in handling the constant flow of data and storing 
it safely, thus a solid data processing pipeline had to be put in place by the IT sector, 
that allows the different sectors of the institute involved in the process (data collection 
and production) to control its correct evolution and the quality of the data. 

However, in a modern data-driven organization IT tools are not only meant to back 
production processes but should be considered one of the main drivers behind the 
organization core business. In the era of Big Data, the capability of analysing large 
amounts of data necessarily requires IT solutions to be effective, governed and secure 
and also available not only to IT specialists but also to researchers. 

Under these premises, the scanner data project was not only relevant for its main 
objective, the redesign of one of the most important production processes of Istat, but 
it also was the first important testbed of a general approach to the modernization of 
the tools used for statistical production. 

In this paper we present the hybrid data architecture that has been realized to support 
the scanner data project, integrating a traditional RDBMS with a Big Data Processing 
Platform, that has been recently setup at Istat and has been used for the first time 
specifically for this project. An in-depth discussion is provided about the benefits and 
the trade-offs resulting from the use of Big Data technology for statistical production. 
We exploited the advanced capabilities of the platform to carry out extensive analysis 
on the whole dataset that could have not been possible through traditional tools. In 
particular, we simulated the implementation of two different methods of aggregating 
elementary indices (“static”, based on a yearly updated fixed basket, and “dynamic”, 
based on chaining prices over consecutive months) and provided the production sector 
with in-depth insights about the performance and the practical feasibility of the two 
methods. 


2 Processing Scanner Data at Istat 


In this section we give a general overview of the scanner data project and discuss the 
challenges that were related to the treatment of a large scale dataset, presenting the 
technical solutions that were adopted to store and process data. 
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2.1 The Scanner Data Dataset 


The periodical acquisition of scanner data from Istat is regulated by an agreement 
made by Istat with the Italian association of modern distribution, representing the 
main chains of modern retail trade. According to the agreement, Istat access to scanner 
data is mediated by a broker, which is the Nielsen company. Nielsen sends data to 
Istat on a monthly basis by uploading the data files on a dedicated web portal. 


Currently, Istat has received data for 4 complete years related to 37 provinces, for a 
total of 1.4 billion records. Each record represents the weekly sum of turnover and 
quantity for a GTIN (Global Trade Item Number, formerly EAN code) sold during 
the week in a single store. The current provisioning of data includes 1470 stores while 
the final sample will be composed of 2100 stores. Also provided are classification 
tables for mapping GTINs to ECOICOP classification and determine the Elementary 
Aggregate they belong to. Data consists exclusively in grocery products, with 1579 
EAs represented and 232,000 GTINs. Finally, the lists of stores and GTINs, integrated 
with additional information (descriptions, geographic location, etc.), are available. 


A novel hybrid data architecture has been setup for storing scanner data. The 
architecture is composed by a traditional Relational Database Management System 
(RDBMS) and a Big Data Processing Platform (BDP). RDBMS stores only current 
year data and handles the cleaning and pre-processing of the acquired data, while BDP 
offloads the database, storing all the historical dataset. The motivation for the use of 
such architecture is explained in detail in the next section. 


2.2 The Problem with Size 


The dimension of the scanner data dataset is not common for Istat, being one of the 
largest in terms of absolute size hosted in the institute. We experienced several issues 
as a consequence of this. Before setting up the BDP, we first loaded the datasets into 
the RDBMS. Besides the table with the raw data, a number of artefacts have been 
produced afterwards, including indexes, views and temporary tables used to store 
intermediate results of processing and analysis. The result is that the size of the whole 
tablespace exceeded 500Gb, which is the dimension over which some database 
administration tasks (like backups) start to become problematic. 


We also experienced problems when analysing and processing data. The time required 
for analytic queries (that is, those involving aggregating a large number of records in 
order to compute distributions and totals) became unpredictable when considering 
queries spanning the whole dataset. In general, data access was slow, making it 
complex to setup and execute smooth analytics processes and pre-processing 
operations like the computation of indices or the cleaning of data. Moreover, 
researchers, for obvious reasons, could not operate with their familiar, desktop-based 
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tools and they were forced to sample portions of the dataset and/or work on partial 
views (e.g. a single province/market at a time). 


It was clear from the early phases of the project that, while the RDBMS is still 
necessary for transactional operations on data, and performs well on datasets that are 
bounded in size, it is not suited to amass indefinite, fast-growing quantities of records, 
that should be analysed as a whole. All these issues motivated the need for an 
additional, and different, technology, to complement RDBMS in our data architecture. 


2.3 Big Data Technology 


We refer to a “big data tool” as one technological artefact specifically designed to 
cope with the features that differentiate so-called Big Data from traditional data, such 
as the size of datasets, the speed at which they are updated and the possible inclusion 
of non-structured content. Big Data tools are largely used in today’s data-heavy 
industry, where data can be produced in the order of terabytes per day. Although this 
order of magnitude is far from our requirements as statistical institutes, the growing 
attention towards the acquisition of new data sources involves size-related issues for 
which Big Data tools might represent a solution, as we discuss in the following of this 
section. 


Big Data tools are based on the notion of distributed computing. The idea is to have 
clusters of interconnected machines working as a whole with the purpose of storing 
and processing data: data is spread on different nodes in the cluster and accessed in 
parallel by each machine. The de-facto standard for distributed computing is the open- 
source platform called Hadoop. Hadoop handles distribution by transparently 
managing inter-node communication. Users are unaware of parallelism and data files 
can be accessed like in a standard file system. 

An Hadoop cluster can reach virtually unlimited scalability by growing in terms of 
space and processing power simply by adding new nodes to the cluster. The possibility 
of using non-specialized hardware make this solution the easier and more economical 
way to support large-scale computation. 


An Hadoop-based system typically includes different components that constitute an 
eco-system for storing and analysing large-scale data. A distributed file system 
component (namely, HDFS) allows to store data, both structured and unstructured, 
organized in directories. Then, a directory can be wrapped in a table-like metadata 
structure, allowing analysts to query data using the familiar SQL language. 
Different tools can be used to analyse the data stored on HDFS. Each tool can access 
the same copy of the data, allowing to pick the tool more suited for a given operation 
without having to make specific data extractions first. Two of the tools included in 
our Hadoop-based BDP were mostly used in the context of this project: 

e Spark: a framework for developing programs that are executed in parallel 

over the Hadoop cluster. 
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e Massively Parallel Processing database (MPP) database: enables real-time 
querying capabilities on data stored on HDFS by exploiting parallelism and 
query optimizations. 


2.4 Advantages of Big Data Tools 


We tested our Big Data Platform by using Spark and the MPP database to implement 
the simulations described in Section 3. Their performance has been compared against 
the standard RBDMS used at Istat. We stress the fact that this is by no means intended 
to be a product benchmark (for this reason we decided to omit product names) that 
would have required putting all products in the best possible conditions in order to 
guarantee a fair and meaningful comparison. Our intention was to assess performance 
at the normal conditions in which we use our production systems, with no specific 
optimizations. The scanner data project was the perfect occasion to test the behaviour 
of both platforms, because the dataset is large enough to represent a significant 
testbed. Moreover we can have exact copies of data tables so that we could run the 
same analytical queries without modifications on both systems over tables containing 
the same data. 


Table 1: Execution time of example operations, BDP vs. RDBMS 


Analyitic query (cumulative Processing of indices at 
turnover per item over a EA/store level over one year 
store/month/year) 

MPP RDBMS Spark RDBMS 

62 sec. 699 sec. 55 min. 7 to 9 hours 


Examples of execution times are showed in Table 1. We can see that through the BDP 
one can achieve significant improvements in performance, with a dramatic impact on 
the whole flow of the analytic activity: getting fast responses from a reactive data 
platform pushes researchers to issue more questions and to target more ambitious 
goals. Moreover, it makes it possible to operate on the whole dataset with no need to 
sample or to create extracts. On the other hand, it is important to point out that Hadoop 
is not meant to replace traditional data management solutions (like RDBMSs or 
statistical software) because its focus on large datasets introduces several trade-offs 
when applied to small/medium datasets. 


3 Analyzing Scanner Data with Big Data Tools 


In this section we present some of the results of an extensive analysis that we carried 
out in order to support the decisions of the production sector about the method that 
will be implemented in production for computing the indices based on scanner data. 
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In particular, following the Eurostat recommendation for processing scanner data [2] 
we considered the two approaches suggested for selecting the set GTINs that will 
contribute to the base indices at EA level: 

e Static: follows the traditional method of considering a fixed basket of 
representative GTINs, to be followed over the entire year. Items that 
disappear along the year must be replaced. 

e Dynamic: selects a representative set of items, considering the matching 
GTINs for each consecutive two months that have turnover over a certain 
threshold. 


The comparison is targeted at assessing the feasibility of the two methods in the 
perspective of their use in production. In particular, since replacements in the static 
approach might be problematic if occurring for a large number of items, we are 
interested in evaluating the actual number of non-matching items per month, to 
understand if and how it could be possible to implement replacements. At the same 
time, we want to understand the behavior of the dynamic approach in terms of the 
global amount and monthly trend of matching items. 


We simulated the complete process to compute EA indices in the two cases, strictly 
following the indications in the guidelines from Eurostat. In both cases data is firstly 
filtered to trim out extreme prices, then monthly prices are computed considering the 
arithmetic average of weekly prices, weighted by quantity, only picking the weeks 
that do not overlap two months. Then, additional cleaning is performed by removing 
all the products that belong to EAs that contribute for less than the 0.5% of the 
turnover for their segment or do not link to the ECOICOP classification. 


3.1 Static Approach 


Implementing the static approach consists in selecting the GTINs that contribute to 
the index. The rule we applied for this, following the guidelines, was to consider the 
set of GTINs that, within each store, in overall contribute to the 80% of turnover for 
each EA. We also consider an extraction with a 50% threshold. Analysis is carried out 
on 2016 data, using the total turnover in 2015 to identify the cut-off thresholds in each 
store/EA pair. The initial number of items in the selected set is 5,127,075 for 80% 
threshold and 2,014,320 for 50%, selected from a total number of items relative to 
December 2015 of 11,987,554. 


Figure 1(a) shows the total number of non-matching items per month over all EAs, in 
the cases of 80% (dark grey) and 50% (light grey) thresholds. This is the number of 
items that should be replaced for computing the index. This number is clearly too high 
to be dealt with by a human operator, so some automatic method should be used. 


Figure 1: Static approach — evaluation of unmatched items 


IT Solutions for Analyzing Large-Scale Statistical Datasets: Scanner Data for CPI 435 


ercentage of unmatched items p 


ML annum di 
| | (b) 


a) 


Figure 1(b) depicts the percentage of non-matching items in the two cases, 
highlighting an evident fluctuation in the item presence over the year. The skewness 
in the global distribution might be due to the influence of seasonal products and, in 
some case, to the lack of entire data store (data not sent by store to Nielsen), that 
should be estimated. 


3.2 Dynamic Approach 


The dynamic approach consists in the computation of micro-indices by comparing 
prices of products over two subsequent months. GTINs are selected on a monthly 
bases by applying a low-sale filter, defined in the guidelines (page 34), that should 
ensure that selected products contribute to a proportion of turnover in the 50%-80% 
range. Moreover, indices over 400% are removed. After filtering, the global number 
of items treated over the entire year is 122,273,323 (out of an initial amount of 
238,295,930 items). 


Figure 2(a) shows the average coverage of turnover per EA against the average 
percentage of selected products in the EA, broken by ECOICOP segment. Size of the 
bubbles is proportional to the total turnover generated by the segment. The graph 
shows that the empirical filtering rule suggested in the guidelines actually removes 
the expected portion of turnover, with around half of the products in the EA 
contributing to at least for the 70% of the turnover in most of the cases. Smaller EAs 
are naturally more concentrated around a small number of products, so less products 
are filtered out in proportion but the remaining ones contribute to almost the entire 
turnover. Figure 2(b) shows the percentage of matched GTINs over the year (light 
grey) and the corresponding percentage of turnover (dark grey). Please note that non- 
matching GTINs were not estimated or imputed in this phase of analysis for the sake 
of simplicity. An estimation of temporarily missing GTINs would ensure an even 
higher amount of matched GTINs and an improvement in index stability. 


Figure 2: dynamic approach — evaluation of matched items 
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4 Discussion and Conclusions 


The analysis carried out over the entire scanner data dataset allowed to derive some 
significant insights. Firstly, the overall number of replacements required in the static 
approach make this method feasible only if automatic replacements are considered. 
However, the realization of such an algorithm seems a rather complex task at the 
moment. Secondly, the dynamic approach showed a surprising stability in the portion 
of items matching at each couple of months, once the less relevant EAs and the less 
sold items were filtered out. Further analysis were carried out but were not included 
for lack of space, while others are planned and will be part of future work. 


In general we can say the use of Big Data tools proved its benefits in terms of 
enhanced analytical capabilities when facing with a challenging dataset such as the 
scanner data one. Using a SQL dialect for accessing data allowed to make a smooth 
transitions to the new platform for trained database users and IT-bounded statistical 
analysts. The integration of our BDP with common statistical tools is nevertheless a 
necessary step in order to involve a broader user base of statisticians. 
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Model-based Clustering with Sparse Covariance 
Matrices 


Model-based Clustering con Matrici di Covarianza 
Sparse 


Michael Fop, Thomas Brendan Murphy and Luca Scrucca 


Abstract We introduce mixtures of Gaussian covariance graph models for model- 
based clustering with sparse covariance matrices. The framework allows a parsimo- 
nious model-based clustering of the data, where clusters are characterized by sparse 
covariance matrices and the associated dependence structures are represented by 
graphs. The graphical models pose a set of pairwise independence restrictions on 
the covariance matrices, resulting in sparsity and a flexible model for the joint dis- 
tribution of the variables. The model is estimated employing a penalised likelihood 
approach, whose maximisation is carried out using a genetic algorithm embedded 
in a structural-EM. The method is naturally extended to allow for Bayesian regular- 
ization in the case of high-dimensional data. 


Abstract In questo lavoro introduciamo una mistura di modelli grafici gaussiani 
per il clustering parametrico con matrici di covarianza sparse. La modellizzazione 
proposta permette una cluster analysis dei dati in cui i gruppi sono caratterizzati 
da matrici di covarianza sparse e le strutture di dipendenza tra le variabili vengono 
rappresentate da grafi. Il contesto dei modelli grafici permette di definire vincoli 
d’independenza tra le variabili, ottenendo modelli parsimoniosi e una rappresen- 
tazione flessibile delle distribuzioni congiunte delle variabili. Un approccio di mas- 
sima verosimiglianza penalizzata è considerato per la stima del modello. La mas- 
simizzazione è effettuata tramite un algoritmo genetico incorporato in un algoritmo 
EM strutturale. Il modello proposto è facilmente estendibile all’analisi di dati di 
elevata dimensione tramite metodi di regolarizzazione bayesiana. 


Key words: Graphical models, genetic algorithms, model-based clustering, pe- 
nalised likelihood, sparse covariance matrix. 
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1 Introduction 


Model-based clustering assumes that the data arise from a finite mixture of Gaus- 
sian distributions where each mixture component is associated to a cluster. In the 
model, the component means and covariance matrices define the characteristics of 
the clusters. The model complexity is led by the covariance terms and the number of 
parameters to be estimated grows quadratically with the number of variables in the 
data. To attain parsimony, a variety of methods has been proposed in the literature; 
for example [2, 5, 1]. However, all rely on matrix decompositions and none of them 
places sparsity directly on the entries of the covariance matrices. 

Moreover, the model does not explicitly consider that some variables may be in- 
dependent of each other in a given cluster. In fact, usually independence is obtained 
by considering mixture components with diagonal covariance matrices, with all the 
variables independent. Although parsimonious, this may be not a realistic assump- 
tion, because in many applications only some of the variables may be independent 
and the association structures may vary across the clusters. 


2 Model framework 


Gaussian graphical models determine a framework for estimating multivariate Nor- 
mal distributions with sparse covariance matrices. In this section we incorporate this 
framework into model-based clustering to obtain a clustering of the data with sparse 
covariance matrices and groups with different dependence patterns. 


2.1 Gaussian covariance graph model 


A graph G is a mathematical object denoted as the pair G = (V,E), where V is the 
set of vertices (or nodes) and E C V x V is the set of edges. 

Let us consider a graph whose vertex set represents a set of random variables 
{X1,...,Xj,...,Xy} distributed according to a multivariate Gaussian distribution. A 
covariance graph model encodes marginal dependencies among the variables ex- 
pressed in the covariance matrix £ [6]. In fact, a missing edge between two nodes 
in the graph corresponds to marginal independence between the related variables. 
For a pair of variables (X,X;) the following properties hold: 


(h, j) ¢ E © Xh IL Xj © Onj = 0, 


with 07; the covariance term in X. Therefore a given graph G poses a set of linear 
constraints on the off-diagonal entries of £, allowing the estimation of a sparse 
covariance matrix with the requirement of being positive definite and belonging to 
€*(G), the cone of positive definite matrices induced by G. 
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2.2 Mixture of Gaussian covariance graph models 


Let X be the N x V data matrix, where each observation x; is a realization of a V- 
dimensional vector of random variables {X1,...,X;,...,Xv}. In a mixture of Gaus- 
sian covariance graph models we assume that the density of each data point is de- 
fined as follows: 


K 
Sal, G) = Vo wo (xl, Ex, Gr) with Exe Gt (Gr), (1) 
kl 


where G={G1,...,Gy,..., Gx} is the set of graphs of the mixture components and 
@* (Gx) is the cone of positive definite matrices induced by graph Gy = (V, Ex). In 
the model, the mixture components are characterized by different edge sets E, thus 
we allow the variables to have distinct association patterns across the clusters. 

For the model in (1) we consider the penalised log-likelihood: 


N K K 
¢(X;¥,G) = ) log Luo ims £1) — Y P(IE4I), (2) 
i=l k=l k=l 

where the penalty term £X; p(|E;|) is a function of the number of edges |E;| in 
each graph, i.e. the number of covariance parameters for each mixture component. 
Different penalisation terms correspond to different modeling strategies for the as- 
sociation among the variables and prior information can be included. 


2.3 Model estimation 


For a fixed number of components K, model estimation corresponds to estimation of 
mixture parameters ¥ and selection of graph structures G. To accomplish the task 
we introduce a structural EM algorithm (S-EM), which allows to estimate parame- 
ters and infer graph configuration in presence of missing data [3]. 

The S-EM algorithm maximises a penalised version of the log-likelihood, where 
the penalisation term is some function of the edge set of the graph. The M step alter- 
nates the maximisation of the expected complete data log-likelihood with respect to 
parameters and graph configuration. In particular, the maximisation of this quantity 
with respect to the component graph structures is a combinatorial problem. Indeed, 
at each step of the algorithm the penalisation term permits to define a scoring rule 
for different edge sets and to search for the best graph. To tackle the problem we use 
a genetic algorithm [7] where the fitness function is the penalised expected complete 
data log-likelihood and the edge sets are expressed as binary strings. The algorithm 
makes use of standard genetic operators, such as mutation and crossover, and allows 
for an efficient search through the graph space. 
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2.4 Bayesian regularization 


In the case of high-dimensional data or when the sample size is relatively small 
compared to the number of variables, singularities may arise in the estimation of the 
covariance matrix. Following [4] we propose a Bayesian regularization approach 
where the maximum (penalised) likelihood estimator is replaced by a maximum 
(penalised) a posteriori estimator. Standard prior distributions are assumed for the 
parameters and hyperparameters are selected appropriately for clustering. 


3 Discussion 


We introduced a framework for model-based clustering with sparse covariance ma- 
trices where clusters can be characterized by different association patterns among 
the variables. 

The method is applied to simulated data and benchmark clustering datasets, 
where it is shown to give good clustering performance and insights about the re- 
lationships between the variables. 
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Quantile Regression for Functional Data 
Regressione Quantile per Dati Funzionali 


Maria Franco-Villoria and Marian Scott 


Abstract Quantile regression allows estimation of the relationship between re- 
sponse and explanatory variables at any percentile of the distribution of the re- 
sponse (conditioned on the explanatory variables). We extend quantile regression 
to the functional case, rewriting the quantile regression model as a generalized ad- 
ditive model where both the functional covariates and the functional coefficients are 
parametrized in terms of B-splines. Parameter estimation is done using a penalized 
iterative reweighed least squares (PIRLS) algorithm. We evaluate the performance 
of the model by means of a simulation study. 

Abstract La regressione quantile permette di stimare la relazione fra una variabile 
risposta e delle covariate considerando un qualsiasi percentile della distribuzione 
(condizionata alle covariate) della risposta. In questo lavoro si estende la regres- 
sione quantile al caso di dati funzionali, riscrivendo il modello di regressione come 
un modello additivo generalizzato dove sia le covariate funzionali che i coeffici- 
enti funzionali vengono parametrizzati attraverso B-splines. La stima dei parametri 
viene effettuata attraverso un algoritmo iterativo di minimi quadrati pesati. La per- 
formance del modello valutata in uno studio di simulazione. 


Key words: B-splines, functional coefficient, generalized additive model, PIRLS 


1 Introduction 


Linear regression has the goal of estimation of the expected value of the response 
variable and its dependence on any set of explanatory variables. However, there 
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might be situations in which the mean of the distribution is not informative, e.g. if 
one is interested in the high values of a given variable. Quantile regression [7] al- 
lows estimation of the relationship between response and explanatory variables at 
any percentile of the distribution of the response (conditioned on the explanatory 
variables). As a result, rates of change in the response variable can be estimated for 
the whole distribution and not only in the mean. Quantile regression is widely used 
and has been applied in different fields such us finance, medicine or the environ- 
ment. On the other hand, growing dimensionality of data available has stimulated 
the development of models for functional data [10], where the observed data are 
considered as a discrete realization of an underlying smooth function, i.e. a curve. 
In this work, we extend quantile regression to the functional case. However, the 
definition of a quantile in a functional data setting is not straightforward given the 
lack of a distribution function. An interesting proposal to define functional quantiles 
is that of Lopez-Pintado and Romo [8], who propose to order the curves based on 
their depth, where the deepest curve would correspond to the median. Quantile re- 
gression for functional data is a relatively new area of research that has only been 
explored in recent years, hence literature available is very limited. Cardot, Cambres 
e Sarda [1, 2] and Kato [6] have extended functional linear regression models to the 
case of quantile regression considering functional covariates and scalar response. A 
non-parametric version was proposed by Dabo-Niang e Laksaci [5], while Crambes, 
Gannoun and Henchiri [3, 4] use support vector machine methods for fitting quantile 
regression models where the covariates are functional and the response is scalar. 
Regression models for functional data need to be addressed differently depending 
on whether the response variable is scalar or functional. In section 2 we discuss how 
the quantile regression coefficients can be estimated when the response variable is 
scalar, while in Section 3 we present preliminary results from a simulation study. 
Extrension to the functional response case is briefly discussed in Section 4. 


2 The Model 
For T € (0, 1) fixed, a quantile regression model: 
Ov(elx(0)) = a+ | Blt)x(e)at = a+ (B.x) 


where Y is a scalar response variable, x(t) is a functional covariate, Qy(t|x(t)) is 
the 1007 quantile of the distribution of Y|x(t) and (,) is the inner product. The 


parameters a, B(t) can be estimated by minimizing the objective function: 


RBH) = Yo p0- a- f BOA) 


i=l 


where p;(u) = u(t — I(u < 0)) is the check function and / is an indicator function. 
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The quantile regression model can be rewritten as a generalized additive model 
where both the functional covariate and the functional coefficients are parametrized 
in terms of B-spline basis functions. The objective function to be minimized is a 
sum of asymmetrically weighted absolute residuals; in the quantile regression lit- 
erature, linear programming methods are used to estimate the unknown regression 
parameters. Instead, we approximate the absolute residuals with the squared resid- 
uals and adjust the weights accordingly. This way the regression coefficients can be 
estimated using a penalized iterative reweighed least squares (PIRLS) algorithm. 


3 Preliminary results 


We evaluate the performance of the estimating algorithm by means of a simulation 
study, where we consider different sample sizes, two levels of noise and various 
forms of complexity for the functional coefficient. We evaluate the performance at 
four different quantiles 490.2, go.s, 90.7 and qo.9. The simulated data are built as 


yim a+ | BRM +e. 


The functional covariate x(t) = Yi &;Bj(t), where B;(t) are B-spline basis func- 
tions evaluated at t € T = [0,1] CR, j = 1,...,10 and the spline coefficients &; ~ 
N(0,1). The random errors £; are simulated from a normal distribution N(qr, 07) 
with qr the 1007’" quantile of the N(0,07); values of © were chosen to ensure a 
signal to noise ratio of 2 and 4. 

To evaluate how well $ (t) is estimated, we consider two indicators, the distance 
(L2 norm) between the simulated and estimated coefficient and the proportion of 
negative and positive residuals. Results from a preliminary simulation study suggest 
that the method performs well; when the sample size is small (n = 50) distance 
values range from 0.01 to 0.45 when the functional coefficient is linear and from 0.1 
to 1.2 when the functional coefficient is non-linear. Results improve with increasing 
sample size and the closer we get to the median, as expected. The percentages of 
positive and negative residuals were very close to the expected 100(1 — t)% and 
1007% respectively. Convergence was reached after 4 to 29 iterations. 


4 Discussion and Future Work 


In this work, we propose a quantile regression model when the covariates are func- 
tional and the response is scalar. Preliminary results from a first simulation study 
suggest good performance for a range of different quantiles. The model can be easily 
extended to incorporate more covariates keeping the computational cost low thanks 
to the use of sparse matrix computation. 
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We are currently working on the case of a quantile regression model where the 
response is functional too. In this case, the residuals themselves are functional data 
and working out the weights is not as straightforward as in the scalar response case. 
A possibility would be to consider the distance from the zero curve as a proxy for 
the size of each residual, while the sign of the residual could be worked out using 
some sort of curve ordering technique such as band depth or a more recent proposal 
based on epigraphs and hypographs [9]. 

In particular, quantile regression for functional data could prove useful in solving 
the problem of uncertainty evaluation of a predicted curve, where the 2.5% and 
97.5% quantiles could be used to build a functional confidence band. 
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Three-way compositional data: a multi-stage 
trilinear decomposition algorithm 


Dati composizionali a tre vie: un algoritmo per la 
decomposizione trilineare a pitt stadi 


Gallo M., Simonacci V., and Di Palma M.A. 


Abstract The CANDECOMP/PARAFAC model is an extension of bilinear PCA 
and has been designed to model three-way data by preserving their multidimen- 
sional configuration. The Alternating Least Squares (ALS) procedure is the pre- 
ferred estimating algorithm for this model because it guarantees stable results. It 
can, however, be slow at converging and sensitive to collinearity and over-factoring. 
Dealing with these issues is even more pressing when data are compositional and 
thus collinear by definition. In this talk the solution proposed is based on a multi- 
stage approach. Here parameters are optimized with procedures that work better for 
collinearity and over-factoring, namely ATLD and SWATLD, and then results are 
refined with ALS. 

Abstract I] modello CANDECOMP/PARAFAC è una generalizzazione per matrici 
a tre indici della ACP. Per stimare i parametri di tale modello la procedura di stima 
più usata è l’Alternating Least Squares (ALS). Tale algoritmo è il più usato in quanto 
garantisce risultati stabili, tuttavia, presenta anche degli inconvenienti, quali es- 
sere lento e sensibile alla multicollinearità e alla sovra-fattorizzazione. Affrontare 
questi problemi diventa poi particolarmente impegnativo quando i dati sono multi- 
collineari per costruzione, come nel caso dei dati composizionali. Come soluzione 
di tali problemi, nel presente lavoro si propone un approccio multi-stadio in cui i 
parametri sono prima ottimizzati con procedure che funzionano meglio quando vi 
è collinearità e sovra-fattorizzazione, cioè ATLD e SWATLD, e successivamente i 
risultati finali sono individuati con V’ALS. 
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1 Introduction 


Observations over a set of variables can be recorded in different occasions, such 
as time or location. These data present a tridimensional structure and the only way 
to obtain a low rank approximation without confusing the variability of two dimen- 
sions together is using multi-linear techniques such as the CANDECOMP/PARAFAC 
(CP) model [2, 10]. This model estimates three separate sets of parameters, one for 
each mode of the analysis, thus is highly complex and the search for innovative 
ways to improve its efficiency without compromising accuracy of results is of great 
relevance. 

The most widely used algorithm for the CP model is currently PARAFAC-ALS 
(ALS) thanks to the merit of granting stable results, a least square solution and an 
always monotonically decreasing fit. It does, however, present some problematic as- 
pects such as slow convergence and sensitiveness to over-factoring, multicollinear- 
ity and factor collinearity. These issues are even more significant when dealing with 
data that present particular challenges such as Compositional Data (CoDa) [1, 11]. 
CoDa can be defined as positive vectors with a purely multicollinear structure as 
their elements describe the parts of a whole and thus only carry relative informa- 
tion. 

Given these considerations, in [9] an alternative way to overcome these difficulties 
in a compositional framework is presented. Specifically it is suggested that in order 
to mitigate ALS inefficiencies this procedure can be integrated by adding an initial- 
ization/recovery stage where parameters are optimized through the Self-Weighted 
TriLinear Decomposition (SWATLD). In this manner a novel two-stage procedure 
is implemented (INT-1). 

SWATLD proposed by [3] was chosen amongst other alternatives because it can be 
seen as complementary to ALS given that its strengths are fast convergence and ro- 
bustness to over-factoring and collinearity while its fallacies are finding a solution 
in a non-least-square sense and unstable results [5, 12, 14, 16]. 

INT-1 appears to work quite well in the simulations presented in the cited article, 
however several ways to improve its performance and reliability were suggested 
in future developments but not yet verified. In this perspective the purpose of this 
contribution is to explore the possibility of improving the performance of INT-1 by 
trying to answer two unresolved queries. 

The first question is the consequence of a methodological comparison with [15] 
where it is argued that the Alternating TriLinear Decomposition (ATLD) proposed 
in [13] works better than SWATLD for initializing random numbers, multicollinear- 
ity and speediness. We thus wondered if ATLD could be considered as an initial- 
ization step. To resolve this, a second multi-stage procedure (INT-2) was devised, 
this time with three stages, to see if adding an ATLD step to start off could improve 
performance. 
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The second problem concerns the identification of an optimal transition point from 
one stage to the next, i.e. are there optimal convergence criteria capable of making 
INT-1 and INT-2 perform at their best? This question is addressed in a simulation 
study on stage transition parameters. 

Once these two aspects are dealt with, a new comparative study can be carried out 
to verify three points of interest: 1) how INT-1 and INT-2 perform with respect to 
ALS for compositional data; 2) which between INT-1 and INT-2 is a better alterna- 
tive; and 3) how do data characteristics such as noise level and factor collinearity 
influence results. 


2 Compositions in a CP model 


Let us consider a three-way array V (J x J x K) with generic positive element vj ;x 
where i= 1...I, j= 1...J, andk =1...K. If its row vectors vi = [viik, +; Visk] 
present a biased covariance structure due to an implicit or explicit sum constraint 
viik +... +Visk = K, where K is a positive constant, the array has a compositional 
structure and should be processed with compositional methodology. 

This bounded covariance imposes a purely multicollinear structure to the data since 
the elements of a compositional vector are not linearly independent. As a conse- 
quence the covariance matrix for each of the K frontal slabs Vy(/ x J) of the array 
V will be singular. 

From a geometric stand point these row vectors are forced in a subspace of RI 
known as simplex and defined as: 


S! = { (viks -+ -3 Viva) i Vink 2 04... Vik 2 0; vak +--+ Vine = K} (1) 


To operate within this subspace a non-Euclidean set of rules, known as Aitchison 
geometry, is used to identify a linear vector space [11]. Compositional vectors can, 
however, be converted into Euclidean space coordinates by using log-ratio transfor- 
mations: pairwise, centered, additive [1] or isometric [4]. 

For the purpose of this contribution we will only be referring to centered log-ratio 
(clr) coordinates which can be expressed as: 


Vilk ViJk 


soa gag] MAE 


Zik = clr (Vix) = jin 


By applying this transformation the tridimensional array of compositions V can 
easily be changed into an array of clr-coordinates Z so that standard algorithms 
can be applied as long as results are interpreted in compositional terms [6, 8]. It is 
important to note that c/r-coordinates by providing an S” to R” projection, do not 
remove the collinearity problem. 

An array of clr-coordinates Z can be decomposed with the CP model in three sets of 
parameters, one for each mode of the analysis. Let F be the number of considered 
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factors, using a slab-wise notation we can write: 


Z; = AD,B' +E, k=1,...,.K (3) 


where A (J x F) and B (J x F) are the loading matrices for the first and second 
mode, respectively; Dx is a diagonal matrix containing the kth row of C (K x F), 
loading matrix of third mode; Z% (J x J) is the kth frontal slab of Z; and Ex (J x J) 
is the corresponding frontal slab of the error array E. 


3 Estimating procedures 


Different algorithms can be used to fit the data to the model. The most common 
one is ALS. This is an iterative procedure where sets of parameters are estimated 
in three successive least-square steps. On the other hand, ATLD and SWATLD are 
also three-step iterative procedures but do not follow a least-square approach and 
are characterized by the use of three distinct objective function, one for each mode, 
which focus on prioritizing the trilinear structure of the data. 

The described algorithms all present some qualities and weaknesses directly derived 
from the properties of their loss functions. ATLD is the fastest at converging and it 
is robust to over-factoring, collinearity and initial values. It does not, however, find 
a least-square solution, it may not monotonically decrease, it is sensitive to noise 
and often does not converge properly. 

On the opposite end there is ALS, the slowest at converging, stable in its results, ca- 
pable of finding a solution in the least square sense but sensitive to collinearity and 
over-factoring. SWATLD occupies a middle ground: it is more stable than ATLD 
but not quite as reliable as ALS, it is pretty fast at converging but slower than ATLD 
while still robust to over-factoring and collinearity. In addition it may still not have 
a monotonically decreasing fit and not converge to a least square solution. 

Given these considerations two multi-stage procedures were devised to try and max- 
imize the advantages and counter-balance the inefficiencies of these algorithms. 
INT-1 is structured in the following manner: in a first stage (recovery stage) param- 
eters are estimated by SWATLD with the purpose of identifying the correct under- 
lying components in case of over-factoring, to deal better with multicollinearity and 
to speed up the procedure; successively in a second stage the solution is adjusted 
through ALS steps (refinement stage) to obtain a least square solution and avoid 
SWATLD instabilities. 

INT-2 presents a similar outline but also includes an additional initialization ATLD 
stage, which could help when dealing with multicollinearity and bad initial values. 
A schematic overview of the procedures is displayed in Fig.1. In both cases step 
transition can be user defined in terms of relative fit and number of iterations. How- 
ever these transition parameters can hugely hinder or improve performance of both 
INT-1 and INT-2, thus ideal values will be identified through a threshold simula- 
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tion study. It is also important to note that for both algorithms at least one iteration 
has to be performed at each stage. Once optimal parameters are found, they will be 
included as defining elements of the procedures. 


INT-1 INT-2 


Initialization/ 
Recovery 
stage 


stops at initialization 
convergence parameter 


stops at recovery 


convergence parameter 
stops at recovery 


convergence parameter 
Refinement 


stage 


Fig. 1 Multi-stage procedures outline 


4 Conclusion 


With the purpose of further developing the findings presented in [9], where a two 
stage SWATLD-ALS algoirthm was introduced, this contribution proposes two im- 
portant advancements: 1) devising a three-step INT-2 procedure to see if initializing 
with ATLD grants additional benefits; and 2) setting up a study to identify ideal 
stage transition parameters for both INT-1 and INT-2. 
To test the goodness of the proposed modifications, a comparative simulation study 
between INT-1, INT-2 and ALS will then be carried out in a compositional setting. 
Given that only partial results are available at this stage, we can make the fol- 
lowing considerations. In terms of ideal transition parameters, there is a trade-off 
between accuracy and efficiency: stricter relative fit convergence criteria (1073 or 
1074) render the algorithms more efficient but less stable. On the other hand looser 
criteria are less fast but more reliable (107! or 107?) and for this reason generally 
preferable. 
In terms of comparative results we expect to see INT-1 and INT-2 (set up with ideal 
parameters) performing similarly to ALS in terms of reliability for correct factor 
estimation and better in case of over-factoring while being far more efficient. INT-1 
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will probably be slightly more reliable but a little slower than INT-2. Complete and 
in-depth results will be discussed during presentation. 
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Nonparametric shared frailty model for 
classification of survival data 


Modelli con termine random condiviso per la 
classificazione di dati di sopravvivenza 


Francesca Gasperoni, Francesca Ieva, Anna Maria Paganoni, Chris Jackson and 
Linda Sharples 


Abstract In this work, we propose an innovative model for fitting grouped survival 
data and for detecting a second level of clusters among groups. In order to achieve 
this goal, we start from a classical semiparametric Cox model and we add a nonpara- 
metric discrete random term as a multiplicative factor. This research question arose 
from a project about healthcare management of Regione Lombardia. We analyze a 
rich administrative database, where several information about patients is collected 
(i.e. dates of hospitalizations, death, comorbidities, procedures etc.). In this frame- 
work, patients are the statistical units and hospitals are the known groups. Through 
the application of this new model, we are able to detect hidden populations among 
hospitals and we provide a clustering tool for survival data. 

Abstract In questo lavoro proponiamo un modello innovativo per dati di soprav- 
vivenza raggruppati, al fine di identificare una possibile classificazione dei gruppi. 
Per raggiungere l’obiettivo, siamo partiti da un modello semiparametrico di Cox e 
abbiamo aggiunto un termine moltiplicativo aleatorio. Questa domanda di ricerca 
è sorta a partire da un progetto di Regione Lombardia focalizzato sul manage- 
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ment delle strutture ospedaliere. Abbiamo analizzato un database amministrativo, 
in cui sono state raccolte diverse informazioni sui pazienti (date dei ricoveri, date di 
morte, comorbidità, operazioni etc.). In questo caso, i pazienti sono le unità statis- 
tiche e gli ospedali sono i gruppi noti. Grazie all’applicazione di questo modello 
innovativo, siamo in grado di identificare popolazioni latenti fra gli ospedali e for- 
niamo un metodo di clustering per dati di sopravvivenza. 


Key words: Hierarchical clustering, time-to-event data, administrative databases. 


1 Introduction 


Classical survival models implicitly assume that statistical units are homogenous, 
meaning that all individuals have the same risk of experiencing the event of interest. 
However, this is a strong assumption. Indeed, it is almost impossible to know all 
relevant risk factors, because of time or, often, because of financial issue. This neg- 
ligence of covariates leads to unobserved heterogeneity, which means different risks 
for different subjects. A possible way to model this heterogeneity consists in includ- 
ing an additional random term, commonly known as frailty term. A frailty model is 
a random effects model for time-to-event data, where the random effect (frailty) has 
a multiplicative effect on the baseline hazard, [13]. Usually, univariate frailty mod- 
els are chosen for capturing the heterogeneity due to unobserved covariates, as it 
was introduced by Vaupel et al., [12]. 

Another, completely different, use of the random term faces the issue of indepen- 
dence among different observations when the statistical units are clustered in several 
groups. Indeed, these frailty models provide an interesting way to capture the depen- 
dence between observations within a group and/or the heterogeneity among groups. 
These models, also known as shared frailty models, are hazard models with a mul- 
tiplicative random term, common to all members of the same group. This random 
term is the realization of a specified distribution implying the need to fix a priori the 
distribution which fits best the data. 

The most common distributions for a frailty term are Gamma and log-Normal, 
and this is a consequence of the simplicity of the computations and the availability 
of the software. Indeed, for the Gamma frailty term, Therneau and Grambsh, [10], 
describe a penalized partial likelihood approach which accelerates the classical com- 
putation made through the EM algorithm. This method is implemented in the func- 
tion coxph, which is part of survival package, [9, 10]. The log-Normal dis- 
tributed frailty was introduced for the first time by McGilchrist, [8], and it is imple- 
mented efficiently in a specific package named coxme, [11]. 

However, these are just two options for the distribution choice, since there is 
quite a wide variety of other distributions, introduced by Hougaard, [6, 7] and by 
Aalen [1]. A deep overview of these frailty models is given by Duchateau and 
Janssen in their book [4], where theoretical computations and examples are reported 
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also for the less used Positive Stable, Power Variance and Compound Poisson dis- 
tributions. 

Then, a key issue is how to choose among all these distributions. There is not any 
guideline available right now for answering this question. So, it is common to see 
analysis in which the performances of different frailty distributions are compared. 

The aim of this work is to introduce a nonparametric frailty term, which allows 
us to avoid the choice of a specific shape for our frailty and to cluster observations, 
at the same time. Indeed, the introduction of a nonparametric random term makes us 
able to detect a possible latent structure in the dataset. The proposal of this in-built 
clustering technique is the most innovative aspect of this paper since there is no way 
of clustering time-to-event data, to the best of our knowledge. 

Moreover, only few works dealt with a discrete frailty term. For example, Guo 
and Rodriguez, [5], proposed a nonparametric frailty term with two masses for esti- 
mating the death hazard rate for children in Guatemala, Caroni et al., [2], proposed 
three different discrete distributions for the frailty term, Poisson, Negative Bino- 
mial and Geometric, while Dos Santos et al., [3], proposed a comparison between 
a model with parametric baseline and nonparametric frailty and a model with non- 
parametric baseline and parametric frailty. In all these papers the baseline is para- 
metric, usually a Weibull. No one of them proposed a model combining the two 
critical points: a nonparametric baseline and a nonparametric frailty term. 

In this work, we want to introduce for the first time a new model for the hazard 
rate which considers a semi-parametric Cox model together with a group specific 
nonparametric frailty term. 


2 Models and Methods 


The aim of this paper consists in investigating a possible dependence among known 
groups, through detecting a second clustering level. Groups are the first clustering 
level, while hidden populations are the second clustering level. In order to achieve 
this goal, we propose a semiparametric Cox model with a nonparametric discrete 
frailty term. 

Ti; is the random variable which models time-to-event for the i-th statistical unit 
in the j-th group, where i € {1,...,n;} and n; is the number of statistical units 
collected in group j. Cj; represents the censoring time and d;; = L{r;<c;} is the 
status index. X;; is a p-dimensional vector of covariates, B is the parameters vec- 
tor and Ao(t;;) represents the nonparametric baseline hazard. Then, we suppose 
the existence of K hidden populations, each one characterized by a frailty level 
wg, and we introduce a new random variable Z; which is equal to 1 if the j-th 


cluster belongs to the k-th population, Zjx I Be(m). To sum up, we have the 
following extra parameters in the model: K, the number of hidden populations, 
with K < J; m= [m,%,..., 7x], the probability vector such that ve Ty = 1, and 
w = [w1,w2,...,Wxg], the frailty levels vector. In this case the hazard rate becomes: 
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A (tij) = Ao (tij)we exp(X}; ) i€ {1,...,1;} jE{1,...,J} KE {1,...,K} 


Starting from the hazard rate, we are able to explicitly write the full likelihood of 
our model as: 


K J "y 


Lea = [IŢI { oles) exp B))% -exp [Aoli w exp(x}B)] P 


=1j=li= 


Finally, we compute the parameters estimates of the proposed model through 
a proper Exepctation-Maximization algorithm. In particular, the 8, the cumulative 
baseline hazard A (t;;), 7 and w;/w1, with k = 2: K, are estimated inside the EM. 
We estimate the ratio of frailty levels instead of the single frailty level because of 
an identifiability issue. On the contrary, the value of K is estimated outside the EM 
algorithm, through a backward selection method. 


3 Application to clinical administrative database 


The dataset is extracted from the clinical administrative database of Regione Lom- 
bardia, and concerns patients that have been hospitalized with a diagnosis of chronic 
heart failure between 2005 and 2012. We have a total of 164,384 events (admission 
in, discharge from hospital and death) for a total of 43,998 patients, identified by 
anonymous codes. The total number of hospitals is 307. 

For this specific application, we focus our attention only on the second admission 
to hospital, omitting those patients that died before the second admission. So, the 
final cohort is composed by 24,075 patients and the recorded hospitals are 291. We 
apply our algorithm on the selected cohort, considering hospitals as known groups, 
and we include three covariates in the model: gender, age and comorbidity index. 
Finally, we detect four different populations, with the following vector of propor- 
tions: 2 = [0.291,0.273,0.274, 0.162]. The estimates of the regression parameters 
are Bcenper = 0.26, Pagg = 0.04 and Bscoy = 0.37. Speaking of the frailty ra- 
tios, we obtain that w2/w is 1.60, w3/wy is 1.99 and w4/w is 2.35. Then, we can 
conclude that being hospitalized in a hospital that belongs to the second population 
rather than being hospitalized in a hospital which belongs to the first population 
leads to a higher instantaneous risk of being readmitted. Similar observations can 
be done for population 3 and 4. 

In order to visualize the latent structures, we plot the Kaplan-Meier estimates of 
the split cohort, see Fig.1. It is immediate to see how the survival curves linked to 
different populations are distant one from each other (neither the confidence inter- 
vals overlap). It is also important to notice the order of curves, indeed we can see 
that the risk increases from the first population to the fourth one, as it is expected 
from the estimated frailty ratios. 
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4 Conclusions 


To the best of our knowledge, the idea of detecting a second level of clustering 
structure through a nonparametric discrete frailty term has never been investigated 
in survival research field. Moreover, there is no available software that allows to 
implement discrete frailty term. Finally, the application to a clinical administrative 
database is very powerful, since it should have a gret impact on healthcare manage- 
ment policies. 


a — Population 1 0.53 0.291 
— Population 2 0.849 0.274 
— Population 3 1.056 0.274 
(©) — Population 4 1.246 0.162 
(e) 
T T T T T 
0 500 1000 1500 2000 2500 


time 


Fig. 1 Kaplan-Meier estimates computed for the four populations identified by our algorithm. 
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Clustering landmark-based shapes using 
Information Geometry tools 


La Geometria dell’ Informazione per la cluster analysis 
delle forme basate sui landmark 


Stefano A. Gattone and Angela De Sanctis 


Sommario In this talk we shall describe a method for clustering shapes configu- 
rations in two dimensions. Variation in the shape space is obtained by introducing 
deformations carrying individual landmarks from one to another. The framework, 
provided by the Information Geometry, is the following. A shape is represented by 
a probability distribution. Then, a Riemannian metric is defined on the shape spa- 
ce and the length of the geodesics with respect to this metric is used to measure 
differences in shape. 

Sommario In questa talk descriveremo un metodo per condurre analisi di cluster 
delle forme in due dimensioni utilizzando la Geometria dell’Informazione. La va- 
riabilità nello spazio della forma viene trattata attraverso deformazioni da un land- 
mark all’altro. L’ambito di lavoro fornito dalla Geometria dell’Informazione é il 
seguente. Una forma é rappresentata da una distribuzione di probabilità. Una me- 
trica Riemanniana é definita sullo spazio della forma e la lunghezza delle geode- 
tiche calcolate rispetto a questa metrica viene utilizzata per misurare le differenze 
tra le forme. 


Key words: Information Geometry, Geodesics, Fisher-Rao distance, Wasserstein 
distance 


1 Introduction 


Shapes clustering is of interest in various fields such as geometric morphometrics, 
computer vision and medical imaging. In the clustering of shapes is important to 
select an appropriate measurement of distance among observations. The aim of this 
talk is to model and clustering shapes configurations in two dimensions using In- 
formation Geometry tools. Information Geometry combines geometry and statistics 
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(Amari and Nagaoka, 2000; Murray and Rice, 1984) and it can be used in Sha- 
pe Analysis (Dryden and Mardia, 1998) to describe mathematically patterns from 
complex systems and their changes in time. 

We consider objects whose shapes are based on landmarks (Bookstein, 1991; 
Cootes et al, 1995; Kendall, 1984). The objects can be obtained by medical imaging 
procedures, curves in space obtained by manually or automatically assigned feature 
points or by a discrete sampling of the object contours. 

Since the shape space is invariant under similarity transformations, that is trans- 
lations, rotations and scaling, an Euclidean distance function on such a space is 
not really meaningful. In order to apply standard clustering algorithms to planar 
shapes, the Euclidean metric has to be replaced by the metric of the shape space. 
Examples are provided in Amaral et al. (2010) and Stoyan and Stoyan (1990) whe- 
re the Procrustes distance was integrated in standard clustering algorithms such as 
the k-means. Similarly, Lele and Richtsmeier (2001) applied standard hierarchical 
or k-means clustering using dissimilarity measures based on the inter-landmark di- 
stances. In a model-based clustering framework Huang et al. (2016) and Kume and 
Welling (2010) developed a mixture model of offset-normal shape distributions. 

Statistical manifolds are the objects of study in Information Geometry. They are 
families of probability density functions with their local coordinates defined by the 
model parameters. Rao proved that the Fisher information matrix induces a Rieman- 
nian metric on a statistical manifold. Geodesics with respect to this metric can be 
used to measure differences in shape. Applications of geodesics to shape clustering 
techniques are provided, in a landmark-free context, by Srivastava et al. (2005) and 
Mio et al. (2007). 

With the aim of clustering shapes, we first describe each landmark using a bi- 
variate Gaussian model, where the means are the landmark coordinates while the 
variances reflects the variability across a family of patterns. Next, we define distan- 
ces between objects associated with different Riemannian metrics. These distances 
are induced by the geodesics of the metrics (geodesic distances). In general, compu- 
ting the geodesic distance requires numerical solutions. We, rather, focus on cases 
where analytical expressions are available: the Fisher-Rao (Costa et al., 2015) and 
the Wasserstein metrics (Takatsu, 2011) for Gaussian distributions. 


2 The method 


Suppose we are given a planar shape configuration, C, consisting of a fixed number 
of labeled landmarks K 


C = {W1, W2,- .., UK} 
with generic element Uk = {Uk1, Uk2, } for k = 1,...,K. Following De Sanctis 


and Gattone (2016) and Peter and Rangarajan (2009), the k-th landmark may be 
represented by a bivariate Gaussian density as follows: 
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f(X: Mi Zx) = (27) [Z| exp y =5 0 Ma) Zp (x Ma) (1) 
with x being a generic 2-dimensional vector and Xx given by 


X; = ofh, = diag( oå , 07) 2) 


where {03,02 } is the vector of the variances of the k-th landmark coordinates, for 
k =1,...,K. The variances capture uncertainties that arise in landmark placement 
and/or the natural variability across a population of shapes. Equation (1) represents 
the k-th landmark coordinates on a 4-dimensional manifold, say 0, = (Ux, Og). The 
space of landmarks can be parameterized through the 0’s identifying them so that 
two shapes S and S’ can be defined as follows: S = (6),...,0«) and S’ = (0),..., 0x). 
For every k, let %(t) with € [0, 1] be a path of the manifold such that %(0) = & and 
%(1) = 0;. From differential geometry we know that a given Riemannian metric g 
induces an inner product < .,. >g on the tangent space of the manifold such that the 
length of %(t) is defined as follows 


1) = [Ie olga 3 


The distance between the k-th landmarks of the two shapes is given by the minimum 
length of the trajectory y(t) 


Finally we use the matrix of pairwise distances between landmarks as distance of 
the two shapes S and S’. 

In the space of Gaussians distributions, we will consider two different Rieman- 
nian metrics which in turn induce two types of geodesic distances. One metric is the 
Fisher-Rao metric gy, defined by the Fisher information matrix g, with generic (i, j) 
entry given by 


d d 
gi = [fx |0)37 log f(x | 6) 377102 f(x] )dx © 


The other metric considered in this talk is the Riemannian metric g,,, which induces 
the Wasserstein distance (Takatsu, 2011). For Gaussian measures it is given by 


dg,,(0,0") = |u — p'||+tr(£)+tr(2°)-2tr(VEIV'E2). (©) 


The geodesic distances induced by these two metrics can be used to define a 
shape distance. The discriminative power of these shape distances will be evaluated, 
in the extended version of this manuscript, in the context of shapes clustering on 
both simulated and real data sets. 
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Space and circular time log Gaussian Cox 
processes with application to crime event data 


Processi di Cox log-Gaussiani spaziali e tempo-circolari 
per lo studio della criminalita 


Alan E. Gelfand and Shinichiro Shirota 


Abstract We view the locations and times of a collection of crime events as a space- 
time point pattern modeled as either a nonhomogeneous Poisson process or a more 
general log Gaussian Cox process. We need to specify a space-time intensity. View- 
ing time as circular, necessitates a valid separable and nonseparable covariance 
functions over a bounded spatial region crossed with circular time. Additionally, 
crimes are classified by crime type and each crime event is marked by day of the 
year which we convert to day of the week. We present marked point pattern models 
to accommodate such data. Our specifications take the form of hierarchical models 
which we fit within a Bayesian framework. We consider model comparison between 
the nonhomogeneous Poisson process and the log Gaussian Cox process as well as 
separable vs. nonseparable covariance specifications. Our motivating dataset is a 
collection of crime events for the city of San Francisco during the year 2012. 
Abstract In questo lavoro si studiano gli episodi di criminalità attraverso pro- 
cessi di punto di Poisson non omogenei e processi di Cox log-Gaussiani mar- 
cati. Questi modelli richiedono di specificare l’intensità spazio-temporale. Inoltre, 
l’interpretazione del tempo come variabile circolare richiede di specificare funzioni 
di covarianza separabili e non separabili valide sul dominio spaziale e temporale- 
circolare. Si presentano i modelli per processi di punto adatti a descrivere questi 
dati. Si propone una formulazione gerarchica del modello secondo l’impostazione 
Bayesiana. I dati analizzati sono relativi agli eventi di criminalità avvenuti a San 
Francisco nell’anno 2012. 


Key words: derived covariates; hierarchical model; marked point pattern; Markov 
chain Monte Carlo; separable and nonseparable covariance functions; wrapped cir- 
cular variables 
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1 Introduction 


The times of crime events can be viewed as circular data. That is, working at the 
scale of a day, we can imagine event times as wrapped around a circle of circum- 
ference 24 hours (which, without loss of generality, can be rescaled to [0,27)). Fur- 
thermore, over a specified number of days, we can view the set of event times, 
consisting of a random number of crimes, as a point pattern on the circle. Suppose, 
additionally, that we attach to each crime event its spatial location over a bounded 
domain. Then, for a bounded spatial region, we have a space-time point pattern over 
this domain, again with time being circular. 

The contribution here is to develop suitable models for such data, motivated by a 
set of crime events for the city of San Francisco in 2012. The challenges we address 
involve (i) clustering in time - event times are not uniformly distributed over the 24 
hour circle; (ii) spatial structure - evidently, some parts of the city have higher inci- 
dence of crime events than others; (iii) crime type - characterization of point pattern 
varies with type of crime so different models are needed for different crime types; 
(iv) incorporating covariate information - we anticipate that introducing suitable 
constructed spatial and temporal covariates will help to explain the observed point 
patterns; (v) the need for spatio-temporal random effects - the constructed spatial 
and temporal covariates will not adequately explain the space-time point patterns; 
(vi) the availability of marks - in addition to a location and a time within the day, 
each event has an associated day of the year which we convert to a day of the week. 
We propose a range of point pattern models to address these issues; fortunately, our 
motivating dataset is rich enough to investigate them. 

We focus on the problem of building a log Gaussian Cox process (LGCP) which 
includes, as a special case, a nonhomogeneous Poisson process (NHPP), over space 
and circular time. We need to build a suitable intensity surface which is driven by a 
realization of a log Gaussian process incorporating a valid covariance function over 
space and time. Typically, time is modeled linearly, leading to a large literature on 
point patterns over bounded time intervals (see, e.g., [1] and [2]). Adding space, [3] 
offer development of a space-time LGCP. [4] consider a space-time process convo- 
lution model for modeling of space time crime events. 

In fact, in this context, it is important to articulate the difference between viewing 
time in a linear manner vs. a circular manner. With linear time there is a past and 
a future. We can condition on the past and predict the future, we can incorporate 
seasonality and trend in time. With circular time, as with angular data in general, we 
only obtain a value once we supply an orientation, e.g., the customary midnight with 
time, although, below, we argue to start the day at 02:00. So, we have no temporal 
ordering of our crime events except within a defined 24 hour window. We are only 
interested in modeling the intensity over space and circular time. For event times 
during a day, wrapping time seems natural. Again, these times only arise given an 
orientation. However, crimes at 23:55 and 00:05 are as temporally close as crimes 
at 23:45 and 23:55. 

Our data consists of a set of crime events in San Francisco (SF) during the year 
2012. Each event has a time of day and a location. In fact, we also have a classifica- 
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tion into crime type and we also have assignment of each crime to a district, arising 
by suitable partitioning of the city. Lastly, we know the day of the year for the event, 
enabling consideration of day of the week effects. 


2 The dataset 


Our dataset consists of crime events in the city of San Francisco in 2012. We have 
three crime type categories: (1) assault, (2) burglary/robbery, and (3) drug. Each 
crime event has a time (date, day of week, time of day) and location (latitude 
and longitude) information. Spatial coordinates (latitude and longitude) were trans- 
formed into eastings and northings. Each crime event is also classified into a district. 
In particular, there are 10 districts in San Francisco: (1) Bayview, (2) Central, (3) 
Ingleside, (4) Mission, (5) Northern, (6) Park, (7) Richmond, (8) Southern, (9 ) Tar- 
aval, (10) Tenderloin (see Figure 2, left panel). Figure 2 (right panel) shows the 
counts of crime events for day of week!. Counts for crime types show different pat- 
terns. Assault events happen more on weekends, but burglary/robbery events happen 
most on Friday. 
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Fig. 1 The map of San Francisco (left) and crime counts on each day of week (right) 


Figure 2 shows the data by type and by day of the week (3 x 7 plots) in the 
form of ‘rose’ diagrams. This figure reveals differences among crime types and 
also differences across day of the week. For example, drug-related crime events are 
observed more from 5 to 7 pm. while burglary/robbery crime events are observed 
later in the day. Overall, the circular time dependence of crime events is seen, i.e., 
large counts from evening to late night and small counts from early morning through 
the middle of the day. In the point pattern model construction below, we model each 
crime type separately and, within crime type, incorporate day of week as a mark. 


1 Below, we take day of the week as 02:00 to 02:00. This definition interprets crime events on, e.g., 
Saturday night as including the early hours of Sunday morning. 
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Crime Events in 2012 by Time of day 
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Fig. 2 Histograms of crime events by type and by day of the week 


3 Modeling and Theory 


Observations on a circle lead us to the world of directional data, as illustrated in 
Figure 2. Once an orientation has been chosen, the circular observations are spec- 
ified using the angle from the orientation to the corresponding point on the unit 
circle. Here, we are only concerned with point patterns on a circle. For the nonho- 
mogeneous Poisson process and log Gaussian Cox process models we only need to 
specify intensities over D x S! where S! is the unit circle. 


3.1 The nonhomogeneous Poisson process (NHPP) and Log 
Gaussian Cox process (LGCP) 


Again, since the crime events are random both in number and in space-time location, 
we think of them as a random point pattern over space and time. We consider the 
two most common models for such a setting: the NHPP and the LGCP. The LGCP 
dates at least to [5]. As a spatial process, it is defined so that the log of the intensity 
is a Gaussian process (GP), i.e., 


logA(s) =X(s)"B+Z(s), Z(s)~YP(0,C). (1) 
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Here, Z(s) is a zero mean stationary, isotropic GP over D with covariance function 
C, which provides spatial random effects for the intensity surface, pushing up and 
pulling down the surface, as appropriate. If we remove Z(s) from the log intensity, 
we obtain the associated NHPP. 

Again, we consider a three dimensional Gaussian process with a two dimen- 
sional location, and one dimensional circular time. In general, we seek Z(s,t) ~ 
GP(0,C), (s,t) € R? x SÌ. We need to specify valid correlation functions over 
R? x Sl. 

[6] proposes families of circular correlation functions (CCF’s) based on trunca- 
tion of familiar spatial correlation functions. He shows that the completely mono- 
tone functions are strictly positive definite on spheres of any dimension, e.g., pow- 
ered exponential, Matérn, generalized Cauchy, and Dagum families. Another exam- 
ple in [6] which we adopt is the generalized Cauchy family, 


—t/a 
Coc(u) = (1 +o”) , for ue (0,7) œe (0,1), t>0. (2) 


where T is a shape parameter which doesn’t affect the positive definiteness as long 
as T > 0. This function is positive definite for any dimension if @ € (0, 1]. It may 
be surprising that restriction of familiar spatial correlation functions to the spherical 
domain maintains positive definiteness on the sphere. 

In the context of the LGCP model, we need to specify the covariance function 
for the latent Gaussian process Z(s,t). Separable space time covariance functions 
are often adopted due to convenient specification and computational simplification 
[7]. The separable specification arises as a product of a valid space and a valid time 
covariance function, i.e., 


Css (hu) = Cs(h)C (u) (3) 


So, we can define a valid space-time covariance function merely by choosing as 
C; any valid covariance function on R° and multiplying it by any of the foregoing 
valid CCF’s. The resulting covariance matrix for a set of (s,t)’s with N s’s by M t’s 
will have a Kronecker product form C, ® C, where C, and C, are N x N and M x M 
covariance matrices. Simplified inverse, determinant, and Cholesky decomposition 
result, making the separable specification computationally efficient and tractable in 
high dimensional cases. 

It is evident that the separable covariance specification is restrictive for real data 
applications because it precludes space-time interaction of the sort we mentioned in 
the Introduction. Various versions of nonseparable covariance functions have been 
proposed for the case where space is again R? and time is linear. We need a nonsep- 
arable version with circular time. In the full paper we develop such specifications 
and fit data using these specifications. Details are omitted here but, disappointingly, 
we found essentially no improvement in model performance for the nonseparable 
choice. 
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We employ constructed space and time covariates. For the spatial covariates, we 
identify a set of landmarks. These landmarks are referred to as crime attractors and 
are selected from centers of commercial activity, i.e., places with high population 
density, high human exposure. Examples might include malls, market streets, and 
amusement centers. For a given landmark, we employ a directional Gaussian kernel 
function as the distance measure from crime location to landmark. That is, inverse 
distance measures risk; the smaller the distance the larger the risk. To form a tempo- 
ral covariate we need a function whose support is the unit circle. Since crime events 
occur more frequently in the evening and night hours, less in the morning and af- 
ternoon hours, the most elementary constructed covariate which reflects this would 
have two levels. Here, we let 


K(t) = u(1+d1(r € [47/3,27))). (4) 


On the 24 hour scale, this choice of kK would be interpreted as adopting level u 
for times between 02:00 and 18:00 and level u(1+ 6) for times between 18:00 and 
02:00 in the morning. u and 6 become model parameters; alternative windows could 
be explored. 

Full model specification, full prior details and full elaboration of the model fitting 
are provided in the full paper. Also, details of the out-of-sample model adequacy and 
comparison leading to preference for the LGCP over the NHPP are provided. 


4 Results 


Table 1 shows the estimation results for the space by circular time LGCP for the 
three crime categories in 2012. With day of week specific Uy, and ôw, Lo and do 
are set to the means of them over the days of week, yielding Uw — Lo and 6, — dp 
as deviations. See Figure 4 below for inference on the 6, across day of the week. 
The spatial covariates mf are positively significant. In particular, pı and f2 for drug 
crimes show larger values than those for the other crime types. This result suggests 
that drug events are more concentrated around landmarks L and Lo. 

Figure 4 shows the posterior mean and 95% CI of Ya A (s7, t7, w)Asrw against 
counts on each day of week. For a given w, Fi A(s}, t? , w% )As,t,w is approximately 
the expected number of crime events on day w a year. The left panel demonstrates 
that the posterior mean of Yai À (sj sti, w)Asw traces the observed counts on days 
of week accurately. The right panel displays the posterior mean and 95% CI of dy. 
Although the variance of ô, is large, this figure shows that ô, varies with day of 
week; for assault, weekend 6’s are larger. Since all of the 6,’s are positive, regard- 
less of day of week or type of crime, we find elevated risk in the evening hours. 
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Table 1 Estimation results for space by circular time LGCP for the full region with separable 
covariance: assault (left), burglary/robbery (middle) and drug (right) 


Assault Burglary/Robbery Drug 

Mean 95%CI IF Mean 95%CI IF Mean 95%CI IF 
Ho 36.94 [20.62, 60.12] 70 36.18 [21.45, 57.79] 70 32.43 [19.04, 58.32] 69 
So 0.342 [0.201, 0.613] 68 0.188 [0.069, 0.424] 69 0.344 [0.137, 0.672] 70 
Bi 1.654 [1.146, 2.023] 69 2.646 [2.038, 3.542] 73 3.874 [2.146, 5.045] 74 
Bo 1.202 [0.614, 1.778] 70 0.470 [0.048, 0.823] 65 3.745 [2.117, 4.399] 71 
o? 5.598 [5.064, 6.471] 73 5.756 [5.331, 6.219] 72 8.424 [7.868, 8.984] 67 
9 0.011 [0.010, 0.013] 70 0.005 [0.005, 0.007] 70 0.027 [0.025, 0.030] 68 
di, 0.137 [0.123, 0.151] 58 0.178 [0.163, 0.196] 59 0.161 [0.140, 0.182] 63 
o Ka 0.066 [0.059, 0.072] 61 0.033 [0.028, 0.037] 65 0.231 [0.207, 0.259] 63 
od, 0.769 [0.681, 0.864] 61 1.027 [0.887, 1.164] 65 1.362 [1.182, 1.540] 60 


Assault Assault 


Sun Mon Tue Wed Thu Fri Sat 


Burg/Rob 


Tue Wed Thu Fri Sat 


Drug 


Fig.3 Posterior mean and 95% CI of counts (left: dotted points are observed counts) and ô, (right) 
on each day of week for the full region: dashed lines are 95% CI 


5 Summary 


We have looked at times and locations of crime events for the city of San Fran- 
cisco. We have argued that these data should be treated as point patterns in space 
and time where time should be treated as circular. We introduced derived spatial 
covariates (using distance from landmarks) and temporal covariates (using day of 
the week). We have looked at NHPP and LGCP models for such data. For the latter, 
we have proposed valid space and circular time Gaussian processes, both separable 
and nonseparable, for use in the LGCP. We have shown that the LGCP outperforms 
the NHPP for the SF crime data. However, strong support for nonseparability is not 
seen. 
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Blind source separation 
Cieco separazione alla fonte 


Abdelghani Ghazdali 


Abstract The paper introduces a new notion of Blind Source Separation (BSS) in 
instantaneous mixtures of both independent or dependent source component signals. 
This approach is based on the minimization of a criterion between copula densities. 
This latter takes advantage of the copula to model the structure of the dependence 
between signal components. Simulation results are presented showing the conver- 
gence and the efficiency of the proposed algorithms. 


Abstract La carta introduce una nuova nozione di cieco separazione alla fonte 
in miscele istantanee di segnali indipendenti o dipendenti. Questo approccio si 
basa sulla minimizzazione di un criterio tra densita di copula. Quest'ultimo ap- 
profitta della copula per modellare la struttura della dipendenza tra i componenti 
del segnale. Sono presentati i risultati di simulazione mostrando la convergenza e 
l’efficienza degli algoritmi proposti. 


Key words: Blind source separation; instantaneous mixtures; Copulas; Mutual in- 
formation; divergence between copulas. 


1 Introduction 


The blind source separation problem is a fundamental issue in different fields, such 
as signal and image processing, medical data analysis, communications, medical 
imaging... etc. The BSS aims to recover unknown source signals, out of a set of 
observations which are unknown, and being linear mixture of the sources. It was 
introduced and formulated by Bernard Ans, Jeanny Herault and Christian Jutten [1] 
since the 80’s, describing a biological problem. In order to separate the data set, 
different assumptions on the sources have to be made. The most common assump- 
tions are statistical independence of the sources, and the condition is that at most 
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one of the components is gaussian, which leads to the field of Independent Com- 
ponent Analysis (ICA), see for instance [2]. Then many methods of BSS have been 
proposed, using second or higher order statistics [3, 4], maximizing likelihood [5], 
minimizing the mutual information [6, 7], minimizing the criteria of @-divergences 
[8], ... etc. A good overview of the problem can be found in [9]. Recently, it has 
been shown in [10, 11, 12, 13] that, based on copula without the assumption of the 
independence of the sources, we can still determine the sources (up to scale and per- 
mutation indeterminacies) of both independent and dependent sources components. 

The rest of this paper is organized as follows: Section 2 indicates some definitions 
and introduces the problem of BSS. In section 3, we describe our approach. Section 
4 illustrates some numerical results. Finally, we conclude the paper and give some 
further research directions. 


2 Problem formulation 


BSS can be modeled as follows. Denoting A the mixing operator, the relationship 
between the observations and sources is 


x(t) :=Als(t)] +b(t), t € R, (1) 


where x is a set of observations, s is a set of unknown sources, and b is an additive 
noise. In this paper, we consider the linear BSS model with instantaneous mixtures, 
the operator A corresponds then to a scalar matrix, and the additive noise is either 
considered as an additional set of sources, or it is reduced by applying some form 
of preprocessing [8]. We assume that the number of sources is equal to the number 
of observations. Then we introduce the model as the following 


x(t) :=A s(t), Vt E R, (2) 


where x € R? represents the observed vector, s € R? is the unknown vector of 
sources to be estimated, and A is the unknown mixing matrix. The goal of BSS, is 
therefore to estimate the unknown sources s(t) from the set of observed mixtures 
x(t). The estimation is performed with no prior information about either the sources 
or the mixing process A € R?*? (i.e. we are not in the bayesian paradigm). Specific 
restrictions are made on the mixing model and the source signals in order to limit 
the generality. The separating system is defined by 


y(t) :=B x(t), t E€ R. (3) 


The vector y(t) € R? is the output signal vector (estimated source vector) and B € 
R?*? is called the separating operator. In other words, the problem is to obtain an 
estimator B closing to the ideal solution A~! using only the observation x(t), which 
leads to accurate estimation of the source s(t) 
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R(t) := B x(t) = S(t). (4) 


3 The approach 


The discrete version of the original problem (2) writes 


x(n) := As(n),n=1,...,N. (5) 
The source signals s(n), n = 1,...,N, will be considered as N copies of the ran- 
dom source vector S, and then x(n), y(n) := Bx(n), n= 1,...,N are, respectively, N 


copies of the random source vector X and Y := BX. 

The aim is to reconstruct an estimated source signal y(t) from the denoised ob- 
served signal x(t). It has been shown in [10, 11, 12, 13] that if we dispose of some 
prior information about the density copula of the random source vector s(t), we 
can detect both the mixing matrix and the sources uniquely for both independent 
and dependent sources. Let Y := (Y,... PAu € RP, p > 1, a random vector, with 
cumulative distribution function (c.d.f.) 


Fy): yE R? bh EA) := FA, Yp) = PI 1 Sp) (0 
and continuous marginal functions 
Fy,(-): yi E Ro Fy, Oi) = P(Y; < yi), Vi=1,...,p. (7) 
The mutual information of Y is defined by 


p 
fr Oi) 


MI(Y) := L, e= fr(y)dy1,...,dyp. (8) 


It is called also the modified Kullbak-Leibler divergence (KL,), between the product 
of the marginal densities and the joint density of the vector. Note also that MI(Y) := 


n 
KLm (i Jis fr) is nonnegative and achieves its minimum value zero iff f,(.) = 
i=1 


p 

JI fy (-) i.e., iff the components of the vector Y are statistically independent. To 
i=l 

clarify more precisely the BSS step, we will study separately, the case where the 
source components are independent, and the case where the source components are 
dependent. 


3.1 A separation procedure for independent sources. 
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Recall that the relationship between the probability density function and copula den- 
sity is given by 


p 
FO) = [m0r (1), Fp). (9) 
i=l 
Assume that the source components are independent. Using the relation (9), be- 
tween and applying the change variable formula for multiple integrals, we can show 
that M/(Y) can be written via copula densities as 


1 
MI(Y) := hi log (=) cy (u) du =: KLm (cq, cy) , (10) 


where cy (u) is the density copula of Y, and cy (u) := 119 1)» (u) is the product copula 
density. Moreover, KL (crj, cy) is nonnegative and achieves its minimum value 
zero iff cy(u) = cy (u), Vu € [0, 1]’, namely, iff the components of the vector Y are 
independent. 

Our approach consists in minimizing with respect to B, the following separation 


criterion: 
MESI 


where E(-) denotes the mathematical expectation. The function B+ KL» (cy],cy) 
is nonnegative and attains its minimum value zero at B = DPA~', where D and P 
are, respectively a diagonal and permutation matrix. In other words, the separation 
is achieved in B = argminKL,, (cy, cy). 

B 


KL» (eq, cy) *= (11) 


3.2 A separation procedure for dependent sources. 


In the case where the source components are dependent, we assume that we dispose 
of some prior information about the density copula of the random source vector s. 
Note that this is possible for many practical problems, it can be done, from realiza- 
tions of s, by a model selection procedure in semiparametric copula density models 
{ce(-); 9 € @ C R°}, typically indexed by a multivariate parameter 0, see [23]. The 
parameter @ can be estimated using maximum semiparametric likelihood, see [24]. 
We denote by @, the obtained value of @ and cg(-) the copula density modeling the 
dependency structure of the source components. Obviously, since the source com- 
ponents are assumed to be dependent, cz(-) is different from the density copula of 
independence cy (+). Hence, we naturally replace in (10), cr] by cg, then we define 
the separating criterion 
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KLm (c3 icy) = da rog (2) y (u) du 


0 
Y 
(eat) 


(12) 


Moreover, we can show that the function BH KL in (c3 ser), is nonnegative and 


attains its minimum value zero at B = DPA~'. The separation for dependent source 
components, is reached in B = argminKLm (cg, cy). 
B 


4 Simulation results 


In this section, we present representative simulation results for the proposed method. 
We will limit ourselves to the case of 2 mixtures 2 sources. We start by illustrating 
the performance of BSS-copula with a simple experiment on independent sources. 
Then we move to use BSS-copula to separate dependent sources. The results will 
be compared with the classical mutual information (MI) criterion, see, [6], for the 
same data. The 2 sources are mixed with the matrix A := [1 0.8;0.8 1]. The gradient 
descent parameter is taken u = 0.1. And the number of samples is N = 2000, and all 
simulations are repeated 20 times. The accuracy of source estimation is evaluated 
through the signal-noise-ratio (SNR), defined by 


SNR; := 1010g;0 Sl ,i=1,2. (13) 


4.1 Independent source components: 


In this experiment, we consider two mixed signals of two kinds of sample sources: 
uniform i.i.d with independent components FIGURE |; i.i.d sources with indepen- 
dent components drawn from the 4-ASK (Amplitude Shift Keying) alphabet FIG- 
URE 2. we observe from FIGURE | and FIGURE 2, that the proposed method gives 
good results for the standard case of independent component sources. 

FIGURE 3 shows the criterion value vs iterations. We can see that our criterion 
converges to 0 when the separation is achieved. 
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Fig. 3 The criterion value vs iterations : uniform independent sources. 


4.2 Dependent source components: 


In this subsection we show the capability of the proposed method (Algorithm ?? for 
dependent sources) to successfully separate two dependent mixed signals, we dealt 
with instantaneous mixtures of tree kinds of sample sources: 


1 i.i.d.(with uniform marginals) vector sources with dependent components gener- 
ated from Ali-Mikhail-Haq (AMH) copula with 0=0.8. 

2 i.i.d.(binary phase-shift keying(BPSK)-marginals) vector sources with dependent 
components generated from Fairlie-Gumbel-Morgenstern (FGM) copula with 
0=0.85. 

3 i.i.d.(with uniform marginals) vector sources with dependent components gener- 
ated from Clayton copula with 0=2.5. 


In figure ( 4)-( 6), we have shown the SNRs for each kind of sample sources. It 
can be seen from the simulations that the proposed method is able to separate, with 
good performance, the mixtures of dependent source components. 

Moreover, figure ( 7) shows the criterion value versus iterations for AMH copula. 
We can see that our criterion converges to 0 when the separation is achieved. 
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Fig. 4 Average output SNRs versus iteration Fig. 5 Average output SNRs versus itera- 
number : Uniform dependent sources from tion number : Bpsk dependent sources from 
AMH-copula. FGM-copula. 
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Fig. 6 Average output SNRs versus iteration 
number : Uniform dependent sources from 
Clayton-copula. 


Fig. 7 The criterion value vs iterations : Uni- 
form dependent sources from AMH-copula. 


4.3 Comparison 


In this section, both independent and dependent signal sources are tested to confirm 
the performance of our proposed method, and compared with the MI method pro- 
posed by [6] for instantaneous linear mixture, under the same conditions. At the top 
of figure ( 8)-( 10), we have shown the means of the SNRs of two sources for each 
kind of sample sources. It can be seen from the simulations of figure 8 (the stan- 
dard case of independent component sources), that the method proposed achieves 
the separation with same similar accuracy as [8]. Likewise in the case of depen- 
dent component sources, one can seen from the simulations of figures ( 9- 10) that 
our method exhibits better performance than the MI one. At the bottom of figures 
( 9)-( 10), we show the criterion value vs iterations. As we can see, the both crite- 
ria of the two methods converges to zero when the separation is achieved. But the 
proposed method gives two well separate sources, unlike the MI one provides two 
independent sources very far from the sources. And that, is clearly seen at the top of 
figures ( 9)-( 10), representing, the means of the SNRs of the two sources for each 
kind of sample sources. 
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Fig. 10 Average output SNRs versus iteration number: uniform dependent sources from Clayton- 
copula. 


5 Conclusions 


We have presened a new BSS algorithm. The approach is able to separate instan- 
taneous linear mixtures of both independent and dependent source components. In 
Section 4, the accuracy and the consistency of the obtained algorithms are illustrated 
by simulation, for 2 x 2 mixture-source. It should be mentioned that our proposed 
algorithms based on copula densities, rather than the classical ones based on prob- 
ability densities, are more time consuming, since we estimate both copulas density 
of the vector and the marginal distribution function of each component. The present 
approach can be extended to deal with convolutive mixtures, that will be addressed 
in future communications. 
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An innovative approach for Opinion Mining : 
the Plutchick analysis 


Un approccio innovativo per Opinion Mining: la 
Plutchick analysis 


Massimiliano Giacalone, Antonio Ruoto, Davide Liga, Maria Pilato, Vito 
Santarcangelo 


Abstract In this work we introduce an innovative approach for “Sentiment Anal- 
ysis” or Opinion Mining, that is classically based on the concept that some words 
have positive or negative meanings. Infact, introducing the Plutchick score, it is pos- 
sible to achieve an Emotional Analysis, that is a deeper analysis over the polarity. 
The original contribution of the paper is to present a program on Italian Emotional 
analysis of social networks hashtag mainly as part of “InfoSphere”. For this scope 
we introduce AIN_EMOTION, an evolution of AIN Thesaurus, that is the first italian 
thesaurus for Emotional Analysis. This analysis gives a ratio of emotional hashtag 
on shared by social network users, can produce a behavioral trend and could be 
applied to any other language simply by changing the “emotional thesaurus”. 
Abstract In questo lavoro si introduce un approccio innovativo per la “Sentiment 
Analysis” o Opinion Mining, che si basa sul concetto classico che alcuni parole 
assumono significati positivi o negativi. In particolare, introducendo il punteggio 
Plutchick, e’ possibile realizzare un’analisi emotiva, cioe’ un’analisi piu’ appro- 
fondita sulla polarita’. Il contributo originale del lavoro e’ quello di presentare un 
programma su Emotional in italiano utilizzando un analisi delle reti sociali hash- 
tag principalmente come parte di InfoSphere. Per questo scopo abbiamo introdotto 
AIN EMOTION, un’evoluzione del AIN Thesaurus, che e’ il primo thesaurus ital- 
iano per l’analisi emozionale. Questa analisi da’ un rapporto di hashtag emotiva 
condiviso da utenti del social network, in grado di produrre una tendenza compor- 
tamentale e potrebbe essere esteso ed applicato a qualsiasi altra lingua semplice- 
mente cambiando il “thesaurus emozionale”. 


Key words: Sentiment analysis, Instagram, Social Network, Sentiment Thesaurus, 
Emotion Projection 
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1 Introduction 


Social networks push people to emphasize their own identity: people want to be 
part of certain kind of groups and, naturally, the wish of being part of some groups 
can be seen as the wish of not being part of many other groups. The aspect above 
mentioned means that social networks generate a world in which groups grow up 
becoming more and more separated if not openly opposed, a world in which identi- 
ties are day by day clustered [8]. 

This sort of clustered Network Society presents some important concerns. Firstly, 
the more a society presents clustered identities the worse the risk of social conflicts 
becomes. Secondly, it seems that public opinion derived from social networks gen- 
erates debates that are more emotional than reasonable. As a result, this kind of 
debates could be a major step towards a further consolidation of the Post-Truth pol- 
itics. If the Web is the place where everybody can express his/her own opinion, and 
where groups develop becoming increasingly isolated or separated, social conflicts 
could arise[9]. 

The reason of such a concern could be the lack of communication between different 
points of view: groups are closed and self-referential, so there is less space for a true 
and sensible debate. This is a huge issue because the Web is luckily to be that place 
where public opinion will be generated in the future. Moreover, the more groups 
become isolated and self-referential the more debates become emotional. In this re- 
gard, it should also be noticed that overload information can exercise a significant 
influence on individual choices and groups. It seems that this kind of information 
reduces the awareness of ideas, pushing also people to rely on groups and collective 
identities or even swapping from one group to another. If public opinion becomes 
emotional and volatile, political powers will try to adjust according to this lack of 
awareness. For example, that could be one of the reasons why western societies are 
facing the so-called populisms [10]. 

As can be seen, there is a good reason to believe that it is vital to develop an in- 
depth understanding of how human emotions and Network Society are related. As 
said, public opinion is becoming more emotional and volatile, in addition to this, 
societies are more clustered and potentially conflictual, and all these aspects seem 
to be politically relevant. In this scenario, social networks are luckily to become in- 
creasingly important in the generation of the common sentiment: they are the place 
where individual choices can be influenced and public opinion can be interpreted 
[11]. 


2 The social network in the infosphere semantic space 


The following research aims to outline a survey methodology that can map the emo- 
tional level of the great Italian conversation on the network as part of the broader 
context of the infosphere. Specifically, it will be given an application of this method- 
ology grafted on the social network “Instagram” with the aim of understanding with 
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regard to the Italian language users: 1) In what direction it propagates the generated 
emotional strength; 2)What are the main conveyed emotional projections; 3) What 
are the 10 most significant emotional biases conveyed. The term in the information 
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Fig. 1 The 10 main polarizations 


infosphere philosophy means the totality of the information space. The infosphere 
is “the semantic space constituted by the totality of the documents, the agents and 
their operations”, where “documents” means any data, information and knowledge, 
codified and implemented in any semiotic format, “agents” are any system able to 
interact with an independent document (such as a person, organization or web soft- 
ware robots) and the term “operations” includes any kind of action, interaction and 
transformation that can be done by an agent and which can be presented in a doc- 
ument. It is an environment in which the organisms are formed as interconnected 
cells [1]. 

It is immediately evident, within the perspective of contemporary Network Soci- 
ety, as the current dynamics of the Great Conversation network is configured as an 
essentially structural part of the infosphere. The spread of the Internet, mobile com- 
munications and digital media, along with a wide range of social software tools are 
driving the development of interactive and horizontal communication networks that 
connect, at any time, local and global. The communication system of industrial so- 
ciety revolved around the mass media, characterized by the mass distribution of a 
one-way message one-to-many, one to many. The communication foundation of the 
network society is the global system of horizontal communication networks, which 
include the multimodal exchange of many-to-many interactive messages, or many 
to many, synchronous and asynchronous. If we consider the public sphere as the 
space in which form public opinion, the analysis of the dynamics at play within 
the great conversation on the network, affected by the events and the agenda setting 
issues, can be useful to read the major changes taking place, producing predictive 
projections on generation of common sense and construction mechanisms of collec- 
tive and individual imaginary [2]. 

However, it must take note that public opinion structured along the great network 
conversation is giving an opinion I can essentially emotional. The sharing of infor- 
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mation overload and Emotional effects are transforming the public debate in a more 
emotional than rational debate [3, 4]. 


3 Methodology 


In the Network Society then the only effective message is an emotional message, 
which starts from something emotionally powerful. The amplification of the emo- 
tional sphere becomes the basis on which the circulation of information in the sphere 
of mass self communication. A movement that runs along the paths laid out by the 
emotional contagion mechanism. It is the confirmation of the claims of almost three 
centuries by philosopher David Hume is the reason to be the slave of the emotions, 
and not viceversa [5]. 

If then you become aware that in this context at the base of the social movements 
decisions there are processes that involve the transformation of the social emotions, 
emotional intelligence operation allows you to frame and define the extent of the 
emotional dynamics in place able to compete in processes of creation of common 
sense and influence the construction of collective and individual imaginary [7]. 
The mapping of the emotions conveyed in the large network conversation is car- 
ried out in this work through the methodology OSINT, proceeding to an activity 
of gathering information by consulting publicly available sources. Specifically, they 
are classified on the basis of an emotional score major adjectives Italian language 
as used by users of Instagram and hashtag products from 10/10/2010 (the year of 
the launch platform in Italy) to 10/01/2016 with a recurrence greater than or equal 
to 100,000. Emotional score attributed to each hashtag is based sull’ AIN thesaurus, 
among the most comprehensive thesaurus for analysis sentiments of the Italian lan- 
guage, which attaches to the main adjectives of the Italian language a positive or 
negative emotional polarity following a scoring system based on the following lad- 
der: + 2 (very positive), + 1.5, + 1, 0.5; 

0 (neutral); 

- 0.5, - 1, - 1.5, - 2 (very negative). 

In addition to clarifying the emotional polarity of each hashtag analyzed, these have 
been classified through the emotional scale proposed by the American psychologist 
Robert Plutchik [6]. 

AIN _EMOTION is the first thesaurus of “emotional” type for the Italian lan- 
guage. The AIN Thesaurus which is the thesaurus for opinion mining (positive, 
negative, neutral) the Italian language has been enriched the speech “EMOTION” 
categorizing each adjective with the emotional scale Plutchick. The following is an 
example, where for each adjective was associated one or more clusters of Plutchick. 
For each word (abandoned, lowered, down, propertied, combined, plentiful, tanned, 
abominable, endearing) there is a relative cluster associated (pain, sadness, accep- 
tance, sadness, serenity, interest, admiration, disgust). 

Thanks to information of the emotional scale it is possible to make an opinion upper 
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Fig. 2 Sentiment Analysis on Instagram and Plutchick Wheel of Emotions 


level mining,not limited to one polarity but adding semantic information of particu- 
lar importance, in order to carry out an investigation more focused on the emotional 
state of the author. In fact, this approach introduces an “emotional” analysis, thanks 
to the use of the granularity of the Plutchick score. The traditional sentiment anal- 
ysis considers only 3 classes (negative,neutral,positive) with a possible weight for 
a better analysis. In the Plutchick approach we have 32 possible classes, 8 with 3 
different degrees (e.g. admiration,trust, acceptance) and 8 intermediate (love, sub- 
mission, awe, disapproval, remorse, contemp, aggressiveness, optimism) as shown 
in Fig.2. 


4 Discussion of results and conclusions 


From the surveys conducted it is summarized possible to state that, with regard to 
users of Italian language, the social network instagram: 1) conveys average positive 
emotional biases: among analyzed hashtag, those used by users most often have an 
emotional rather than a positive polarity; very few, in parallel, are in effect the hash- 
tag with an average negative emotional polarity; 2) It constitutes an environment in 
which the most represented social emotional classes are those relating to dell’ “ex- 
pectation sphere” and the “pain”. Almost nonexistent turn out to be the emotions 
that belong to the emotional classes of “anger” and “disgust”. 

Similar results should be read in light of an observation: the form of the con- 
tent and architecture of social environments are more important than the content: 
from communication processes switching to narratological environments. And the 
stories conveyed by Instagram users, because of the nature of the medium, are on 
average functional to one goal: the digital packaging of the self. In this sense the 
self-production of content carried converges towards selfmarketing with the aim, 
more or less conscious and said, to show up on the market relations in a positive 
light as much as possible. A trend that seems to respond to a need for performance 
in a society where it seems that you have to be seen to exist. No wonder then that 
the emotion conveyed through the hashtag analyzed are crushed on emotional biases 
of almost exclusively positive type. It is a very precise seduction strategy of nego- 
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Fig. 3 Emotions extracted by M. Di Lecce tool 


tiating type: convey positive emotions to receive positive in return. In this sense 
then on Instagram it seems to be banned any type of really critical about the real 
comparison. Evidence that is even more overwhelming if you take into account 
a relational dynamic imprinted on a omofiliaco logic. In this sense the emotional 
dynamics conveyed through Instagram eventually attributed to likeability, the prin- 
ciple of “pleasure”, the only true God to be served in the development of a strategy 
of attention conveyed through its social narratives, much of the power creation of 
common sense. 
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A G.E.D. method for market risk evaluation 
using a modified Gaussian Copula 


Un metodo G.E.D. per la valutazione del rischio di 
mercato usando una Copula Gaussiana modificata 


Massimiliano Giacalone and Demetrio Panarello 


Abstract In this paper, we show some results regarding the evaluation of Value-at- 
Risk (VaR) of some portfolios using a Gaussian Copula, modified by introducing the 
Generalized Correlation Coefficient, and assuming a Generalized Error Distribution 
(G.E.D.) for the single returns in the portfolios. In the literature, various authors 
considered the Copula function approach to evaluate market risk. In our proposal 
we consider a Lpmin algorithm to estimate p, the shape parameter of the distribu- 
tion. Finally, we compare the classical RiskMetrics method with our G.E.D. method 
based on a modified Gaussian Copula. 

Abstract In questo lavoro vengono mostrati alcuni risultati riguardanti la val- 
utazione del Valore a Rischio (VaR) di alcuni portafogli utilizzando una Copula 
Gaussiana, modificata introducendo il Coefficiente di Correlazione Generalizzato, 
ed assumendo che i singoli rendimenti dei portafogli siano distribuiti secondo una 
Generalized Error Distribution (G.E.D.). Nella letteratura, vari autori hanno af- 
frontato il tema della valutazione del rischio di mercato considerando l’approccio 
della funzione Copula. Nella nostra proposta consideriamo un algoritmo Lpmin per 
stimare p, il parametro di forma della distribuzione. Infine, confrontiamo il classico 
metodo RiskMetrics con il nostro metodo G.E.D. basato su una Copula Gaussiana 
modificata. 
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Error Distribution, Generalized Correlation Coefficient. 
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1 Introduction 


One of the most important issues in finance is to correctly measure the riskiness of 
a portfolio, which is fundamental to preserve its value over time. Since asset returns 
are usually fat-tailed, the use of Gaussian processes leads to an underestimation of 
the risk (Rachev et al., 2005). 

Value-at-Risk (VaR) is used to quantify the risk of loss of an asset or a portfolio. 
The most straightforward method to calculate the (1-c)% Value-at-Risk is the Risk- 
Metrics one (Longerstaey et al., 1996), where it is hypothesized that the returns R;, 
with i=1,2,...,N, of the N assets of a portfolio are jointly distributed according to a 
Gaussian multivariate (Kasch & Caporin, 2013). However, this hypothesis is sim- 
plifying, since it only considers the first two moments, neglecting the fact that the 
variations in asset returns usually have a leptokurtic and asymmetric behavior (Ca- 
porin, 2003). 

For a better calculation of the risk, one of the proposals (e.g. Malevergne & Sornette, 
2003) is to model the returns’ interdependence of the assets in a portfolio by means 
of Copula functions. Here, the problem is to identify the marginal distributions that 
best model the returns of the single assets, and to define the Copula which is more 
suitable to represent the returns’ interdependence structure. 


2 The Generalized Error Distribution and the Gaussian Copula 


The Generalized Error Distribution (G.E.D.) family was introduced by Subbotin 
(1923) and has been employed by various authors with different names and param- 
eterizations. In our paper, we will use the Vianelli (1963) parameterization, which 
is: 


1 
20pp "TU /p 


1x- 
FM, opp) = jel |tjP) for -<x<% (1) 


P Op 


where u is the location parameter, 0, = [E|x — ue] P is the scale parameter and 
p = 1 is the shape parameter. 

A Copula is a simple function that associates univariate marginal distributions to 
their joint ones (Jaworski, 2010). There are various Copula functions in the liter- 
ature (McNeil et al., 2015) and others can be introduced, in order to capture the 
different dependence structures among stochastic variables. 

In the bivariate case, the Gaussian Copula is: 


_ pO lu) pb!) 1 —(r°—2prs+s) 
C(u,v|p) = foe Ja aper! 2(1-p?) }dr ds 
where @_! is the inverse of Gaussian distribution function and p is the Pearson’s 
correlation coefficient. 
The Gaussian Copula considered here is indicated with C(p,), since the p parame- 


A G.E.D. method for market risk evaluation using a modified Gaussian Copula 487 


ter is replaced by the Generalized Correlation Coefficient pp, introduced by Taguchi 
(1974) as the correlation parameter of a bivariate Generalized Error Distribution, 
and defined as (Agro & Martorana, 2002): 


no codispl?) (X.Y) 
Pp = a) 


with-1< pp <1 
where 
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sign(Y — uy )]|, 
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0p(Y) = [E]Y — py|?]'/?, 


Ux and uy power means of order p. 


3 The algorithm 


The joint density required for calculating portfolio’s Value-at-Risk is obtained as 
(Agro, 2008): 


l-c=f Sure var f(s,t)ds dt 


The parameters u, p and o are estimated using the LPmin method (Giacalone, 1996; 
Giacalone & Richiusa, 2006). 

The p, parameter is estimated using the Exponentially Weighted Moving Average 
recursive formula: 


cod isp”), (x,y) 
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The fundamental steps in the algorithm are: 


e estimation of the parameters Li, pi, O; for the two series of returns; 


2 
e estimation of the p, parameter, with p = £ pi/2; 
i=l 
e generation of (x, y) pairs, which is the realization of the double stochastic vari- 
able (X, Y) having G.E.D. marginals and relation of dependence expressed by a 
C(Pp); 
e calculation of the distribution function of the returns in the portfolio; 
e identification of Value-at-Risk of the return distribution. 


4 Application and results 


In order to evaluate and compare the performances of the two considered VaR meth- 
ods, two portfolios were constructed, as is described below: 


1. a Bond-ETF Portfolio, made up of a BTP-1FB37 4% Italian Bond, a BTP-1MZ21 
3.75% Italian Bond and a LYXOR Exchange-Traded Fund. The data used are 
the daily prices for the years 2012-2016, for a total of 1267 data (data source: 
Teleborsa.it); 

2. an Exchange indices Portfolio, made up of three indices on stock exchanges: 
the Euro-US Dollar (EUR-USD), the Pound Sterling-US Dollar (GBP-USD) and 
the Swiss Franc-Yen (CHF-JPY). The data used in this case refer to the daily 
quotations 2012-2016 for a total of 1305 data (data source: Investing.com). 


Each time series of daily prices p; was transformed into a series of logarithmic re- 
turns according to the relationship: 


R; = log(pi+n) — log(pi) 


where h, the temporal interest interval, is set as one day. 
The estimates of the p shape parameter were f < 2; hence the distributions of the 
returns are leptokurtic and more fat-tailed than the Gaussian one. Figure 1 shows 
the returns of the two portfolios, with the adaptation of a Gaussian distribution and 
a G.E.D. one. 
The reliability of the methods for calculating Value-at-Risk is evaluated by means 
of a backtest, the (1-c)% VaR prediction data being compared with the values of 
profits and losses effectively recorded in the market. 

The RiskMetrics method (VaR-R.M.) and the G.E.D. method, here called VaR- 
G.E.D., were applied to the two portfolios. 
The backtest applied to the Bond-ETF and Exchange indices portfolios to predict 
2.5% VaR and 0.5% VaR highlighted the better predictive capacity of the method 
which considers the leptokurtosis of marginal distributions. The G.E.D. VaR gives 
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Fig. 1 Bond-ETF (left) and Exchange indices (right) Portfolios, fp = 1.4 


predictions which are closer to the real losses, also in relation to the extreme losses 
that are present in the time series of returns. Moreover, the extreme event, i.e. the 
exceptional loss or profit, is not somatized in a short time but influences subsequent 
predictions, which hence are of a cautionary type (overestimation of the risk). 

The number of VaR violations can be seen as a binomial stochastic variable in which 
the probability of success p is the percentage of VaR violations predicted (for ex- 
ample 5%) and the number of trials m is the number of days used for the backtest. 


Bond-ETF Port. Exchange indices Port. 
2.5% 0.5% 2.5% 0.5% 


VaR-R.M. 6 4 5 3 
Var-G.E.D. 5 2 3 2 
Confidence intervals 1-9 0-3 1-9 0-3 


Table 1 VaR violations and 95% confidence intervals 


Tab. 1 gives the number of VaR violations recorded for the two methods and the 
relative 95% confidence intervals in a number of observations m = 200. It shows 
that, for both portfolios, the VaR violations with the VaR-G.E.D. method are lower 
than the ones with the VaR-R.M. method and are within the confidence range. 


5 Conclusions 


In our application, we made a comparison between the Gaussian and the G.E.D. 
Copula. The reason is that the variation of the p shape parameter allows the G.E.D. 
to represent all the symmetric distributions that are described in the literature. That 
is, after estimating p, we are able to use the Copula which best fits our data: all the 
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Copula functions can be obtained as particular cases of the G.E.D. Copula. 

Among the different methods proposed in the literature for calculating Value-at- 
Risk, we took into account the well-known RiskMetrics method. We proposed a 
G.E.D. method and evaluated its performance compared to the RiskMetrics one. 
The two methods were evaluated by backtest, in order to examine the ability of pre- 
dicting the potential loss of a portfolio. 

The results obtained confirm the higher performance of the G.E.D. method, while 
the assumption of normality of the returns’ distribution determines confidence in- 
tervals with the lowest predictive power. The assumption of normality, subject to 
verification, was rejected as the returns of all stocks examined have kurtosis char- 
acteristics which are neglected by the RiskMetrics method. It does seem that the 
VaR-G.E.D. method can constitute a valid generalization of the VaR-R.M., which it 
is close to in the case of Gaussian marginal distributions, while it moves away from 
it if the distributions are more fat-tailed. 

All the necessary calculations have been implemented and processed on the statisti- 
cal environment R. 
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Labour market dynamics and recent economic 
changes: the case of Italy 


Dinamiche nel mercato del lavoro e recenti cambiamenti 
economici; il caso italiano 


Chiara Gigliarano and Francesco Maria Chelli 


Abstract Aim of the paper is to analyse the differences in survival times of job 
contracts among subgroups of workers, based on different socio-demographic 
characteristics such as age, gender, educational level, geographical area. We in 
particular also the well-known Gini index to the measurement of concentration in 
survival times within groups of workers, and as a way to compare the distribution of 
survival times across such groups. We consider a test for differences in the 
heterogeneity of survival distributions, which may suggest the presence of a 
differential covariates effect on the job contract survival. The analysis is based on 
the Italian Compulsory Communications system data, which record all the 
activations, transformations, fixed-term extensions and anticipated terminations of 
employment relationships between any worker and employer in Italy. 

Abstract /! lavoro analizza le differenze nella durata di contratti di lavoro tra 
gruppi di lavoratori, basati su caratteristiche socio-demografiche quali età, sesso, 
livello di istruzione ed area geografica. In particolare, mediante l'indice di Gini si 
misura la concentrazione nei tempi di sopravvivenza per gruppi di lavoratori, al 
fine di confrontarne le distribuzioni. L'analisi si basa sui dati delle Comunicazioni 
Obbligatorie, che registrano tutte le attivazioni, trasformazioni, estensioni e 
cessazioni di contratti di lavoro dipendente in Italia. 


Key words: Labour market, Survival analysis, Gini index, Compulsory 
Communications system data 
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1 Introduction 


The Gini index is one of the most important statistical indices employed in social 
sciences for measuring concentration in the distribution of a positive random 
variable; it is mainly used in economics as a measure of income or wealth inequality 
among individuals or households (see, e.g., Gini 1912, 1914). Recently, the Gini 
coefficient has been used to describe concentration in levels of mortality, or in length 
of life, among different socio-economic groups, and to evaluate inequality in health 
and in life expectancy (see, e.g., Hanada 1983; Bonetti et al. 2009). 

Aim of this paper is to analyse the differences in survival times of job contracts 
among subgroups of workers, from the point of view of concentration. We examine 
the differences both in the length of the first job contract and in the waiting time 
between the end of the first contract and the beginning of a new one. We apply the 
well-known Gini index to measure concentration in survival times within groups of 
workers, and as a way to compare the distribution of survival times across such 
groups. We consider a test for differences in the heterogeneity of survival 
distributions, which may suggest the presence of a differential covariates effect on 
the job contract survival. The analysis is based on the Italian Compulsory 
Communications system data, which record all the activations, transformations, 
fixed-term extensions and anticipated terminations of employment relationships 
between any worker and employer in Italy since January 2009 until June 2012. The 
target population is made up by the young workers, between 18 to 35 years old. 

The rest of the paper is structured as follows: in Section 2 we briefly review the 
Gini test for survival data; in Section 3 we analyse the Italian labour market from the 
point of view of concentration; in Section 4 we conclude. 


2 The Gini index for survival data: a brief review 


The Gini index measures concentration in the distribution of a positive random 
variable. Bonetti et al. (2009) propose to apply the Gini index in survival analysis in 
order to measure concentration in survival times within groups of subjects. In 
particular, they apply a restricted version of the Gini index to right-censored survival 
data in order to detect differences in concentration (heterogeneity) between the 
survival time distributions of two groups. 

A number of nonparametric statistical tests exist in the literature to test the 
difference in survival distribution functions between groups. Common tests are in the 
class of weighted linear rank tests, including the log-rank test (LR test), the 
Wilcoxon test (W test), the Gray and Tsiatis test (GT test); see, e.g., Harrington and 
Fleming 1982; Gray and Tsiatis 1989. Testing for differences between survival 
distributions via a concentration measure may prove more powerful than these 
methods, for example when one is far from the proportional hazard structure. 
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The Gini coefficient of concentration for a positive random variable X with 
cumulative distribution function F and survival function S is defined as 


ur 


if [B-ra _ _ SS dx 
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see Hanada, 1983. In survival analysis subjects have usually a finite follow-up time, 
so we consider the restricted version of the Gini index: 


Di SISMA 
Ji s@dx 


1 > 


where t represents the longest follow-up time in the data. 


Minimum value of G is reached when all subjects have the same survival time, 
while maximum value is obtained when one individual has the maximum survival 
time and the rest of the population experiences the event immediately. 

Bonetti et al. (2009) and Gigliarano and Bonetti (2013) propose a test based on 
the restricted Gini index G, for comparing two survival functions related to two 
different groups. Their Gini test is aimed to test for differences in two survival 
distributions from the point of view of concentration. The Gini test statistic is 


___ (Gu - ba) 
Var(G,.) + FO Gao) 


where Gi 1 is the estimator of the restricted Gini index for censored data referred to 


the group j and Var(G;. ) is the estimator of the approximate variance of Gi , for 
group j,j= 1,2. 

Bonetti et al. (2009) prove that under the null hypothesis of equality of the two 
survival distributions, the statistic T has an approximate chi-squared distribution 
with 1 degree of freedom, while, under any alternative to the null hypothesis, T is 
distributed as an approximate noncentral chi-squared distribution. 


3 Data description 
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The empirical illustration is based on a sample of the Compulsory 
Communications ("Comunicazioni Obbligatorie") data provided by Italian Ministry 
of Labour and Social Policies.! 

The Compulsory Communications (henceforth, CC) data include all activations, 
transformations, fixed-term extensions, early-anticipated terminations of a working 
relationship, either public or private. 

The sample refers to all Italian workers born on 15 January, 15 April, 15 July and 
15 October of any year. Our database therefore includes about 1 out of 91 of all 
workers who have been involved in the CC system over the period between January 
2009 and June 2012. 

The population of interest are the 18-35 aged workers who activated a contract in 
2009. Individuals who entered the CC database for the first time after December 31, 
2009 are excluded from the analysis. 

The CC data have as unit of observation the contract ("contratto di lavoro"), 
defined as a working relationship between an employer and an employee and 
characterized by a starting date. However, in the context of mobility analysis, the key 
concept is the worker rather than the contract; therefore, the worker’s history needs 
to be reconstructed starting from the original CC data, so that the observation unit 
becomes the individual. 

For more details on the data preparation and cleaning process we refer to Lilla 
and Staffolani (2011), while further information on the methodology for joining 
different contracts corresponding to same individual can be found in Picchio and 
Staffolani (2013). 

CC data provides information on the daily occupational status of an individual. 
Here for simplicity a monthly unit of time is considered, and for each month he 
prevalent contract is selected (according to type and length of contract). 

The variable of interest is the occupational status. Four are the types of 
occupational status considered, that are ordered as follows: (i) not in employment, 
(ii) temporary contract, including fixed-term contract ("contratto a tempo 
determinato"), parasubordinate contract ("contratto di collaborazione coordinata e 
continuativa"), internship contract ("contratto di stage"), interim contract ("lavoro 
interinale"), (iii) apprenticeship contract ("contratto di apprendistato"), (iv) 
permanent contract, that is the open-ended contract ("contratto a tempo 
indeterminato"). 

We apply the Gini test discussed above to the measurement of concentration in 
survival times within groups of workers, and as a way to compare the distribution of 
survival times across such groups. 

Analysis of the differences in survival times of job contracts has been performed 
among subgroups of workers, based on gender, educational level and geographical 
area. 


' The Compulsory Communication Data are used with the permission of the Ministry of Labour and 
Social Policies thanks to the agreement between the Department of Economics and Social Sciences of 
Marche Polytechnic University and General Department for the Innovation Technology of the Ministry of 
Labour and Social Policies. The authors are grateful to Stefano Staffolani and Matteo Picchio for the data 
preparation. 
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In particular, we have analysed differences both (i) in the length of the first job 
contract and (ii) in the waiting time between the end of the first contract and the 
beginning of the second one. The results are summarised in Table 1 and illustrated in 
Figures 1 to 4. 


sx 
s% 
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Figure 1: Male versus female. Left-hand side: Length of the first job. Right-hand side: Waiting time for 
a new first job. 


A first analysis is aimed at determining whether there are gender differences in the 
Italian labour market. Figure 1 and Table 1 reveals that there exists no significant 
difference between young males and young females in the waiting time between the 
end of the first contract and the beginning of a new one, while significant differences 
emerge in the length of the first job contract, which is longer for males and females. 
We also test for the presence of significant impact of the educational level on the 
Italian labour market: Table 1 and Figure 2 shows that tertiary education helps in 
finding quickly a new job, while it seems not so relevant for activating permanent 
contracts. With a particular focus on the tertiary economic sector, if a worker has 
tertiary education he will find quicker a job at the end of the first contract, but the 
length of his first contract will be shorter, in comparison to workers in the same 
economic sector but without tertiary education (see Table 1 and Figure 3). 


Sx 


Figure 2: Tertiary education versus non-tertiary education. Left-hand side: Length of the first job. Right- 
hand side: Waiting time for a new job. 
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Table 1: P-values of Gini, Gray-Tsiatis (GT), Log Rank (LR) and Wilcoxon (W) tests for different 


groups comparisons. 


Gini GT LR w 
Length ofthe 9.0152 0.0051 0.4041 04997 
first job 
GENDER 
(Male versus female) Waiting ti 
Waiting time for 0 8366 0.7629 0.9687 0.9865 
new job 
Length ofthe 9 9999 0.4646 0.0000 0.0000 
EDUCATION first job 
(Tertiary versus non EA 
tertiary) Waiting time 9.9000 0.0000 0.0000 0.0000 
for new job 
EDUCATIONIN  Lengthofthe ©0000 0.0000 0.5114 0.0463 
TERTIARY first job 
SECTOR 
(Tertiary versus non Waiting time 
nm fornewiob 00000 0.0000 0.0000 0.0000 
Length ofthe ©8834 0.0000 0.0000 0.0000 
GEOGRAPHICAL first job 
AREA 
North versus South iti i 
( ) Waiting time — 9 9999 0.0000 0.0000 0.0000 
for new job 
[E al = 
o 1000 2000 sao Ra 5000 FER un RR RN a 
Figure 3: Tertiary education versus non-tertiary education within the tertiary economic sector. Left-hand 


side: Length of the first job. Right-hand side: Waiting time for a new job. 
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Finally, we compare the Italian macro areas (North, Center and South): no 
statistically significant differences emerge between North and Center of Italy (data 
are not shown), while differences emerge between North (or Center) and South of 
Italy. Table 1 and Figure 4 reveals that the labour market in the North of Italy is 
characterized by higher percentage of permanent contracts and by shorter waiting 
time for the activation of the second contract, if compared to the South of Italy. 


Figure 4: North versus South of Italy. Left-hand side: Length of the first job. Right-hand side: Waiting 
time for a new job. 


4 Concluding remarks 


In this paper we have examined the Italian labour market dynamics from a novel 
point of view, based on the concentration analysis. 

The empirical analysis revealed that there exists no significant difference 
between male and female in the waiting time between the end of the first contract 
and the beginning of a new one. Gender differences emerge, instead, in the length of 
the first job contract, which appears to be significantly longer for males than for 
females. 

Significant differences emerge also among geographical areas: the North of Italy 
has the highest percentage of permanent contracts and also the shortest waiting time 
for the second contract. 

Finally, different levels of education have different impact on the Italian labour 
market: tertiary education helps in finding quickly a new job, while it seems not so 
relevant for activating permanent contracts. 
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On the use of DISTATIS to handle multiplex 
networks 


L’utilizzo di DISTATIS per l’analisi delle reti multiplex 


Giuseppe Giordano, Giancarlo Ragozini and Maria Prosperina Vitale 


Abstract Multiplex networks arise when there exists more than one source of rela- 
tionships for a common set of nodes. For such data, the usual approaches consist of 
dealing with each relationship separately or of merging the information in a unique 
network. In the present contribution, we propose using factorial methods to visu- 
ally explore the complex structure in multiplex networks. Specifically, the derived 
adjacency matrices from one-mode multiplex networks are analyzed using the DIS- 
TATIS technique, an extension of multidimensional scaling to three-way data. This 
technique allows the representation of the different types of relationships in separate 
spaces for each layer and in a compromise space. How the analytic procedure works 
is illustrated using a real-world example. 

Abstract Una rete è detta multiplex quando lo stesso insieme di nodi è connesso 
attraverso diversi tipi di relazioni. Questo tipo di reti è usualmente analizzato con- 
siderando i legami singoli o derivando una rete unica che nasce come combinazione 
dei diversi tipi di relazione. L’obiettivo del presente lavoro è di estendere l’utilizzo 
dei metodi fattoriali per esplorare la struttura complessa delle reti multiplex. In 
particolare, le matrici di adiacenza derivate da reti multiplex sono analizzate con 
la tecnica DISTATIS, ovvero lo scaling multidimensionale per dati a tre vie. Questo 
metodo permette di rappresentare i diversi tipi di relazioni in spazi separati e in 
uno spazio unico, detto compromesso. Le potenzialità dell’approccio proposto sono 
discusse attraverso un caso studio. 
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1 Introduction 


Multilayer network data arise when there exists more than one source of relation- 
ships for a common set or different sets of nodes. For instance, in social networks, 
one can consider several types of relationships of different actors: friendship, neigh- 
bors, kinship, membership, etc. A multiplex network is a special case of a multilayer 
network [4] that consists of a fixed set of nodes that interacts through different types 
of relationships. For this kind of data, the usual approaches consist of dealing with 
multiple relationships separately or of flattening the information embedded in all 
layers. The latter reduces the complexity of multiplex data and may lead to a loss 
of relevant information. To cope with this issue, it could be useful to propose ana- 
lytic tools that can be used to adapt multivariate methods to network data [2]. In this 
regard, factorial methods have been proposed in the social network analysis (SNA) 
framework to explore different network structures [see, e.g., 3, 6, 10, 12], including 
attributes of nodes and events [7], or to analyze network-derived measures [9]. In the 
case of multiplex networks, canonical correlation analysis was adopted to identify 
dimensions along which networks are related to each other [2], and an analytical 
procedure was recently introduced for dimension reduction using cluster analysis 
[14]. 

To this end, the present contribution aims at extending the use of factorial meth- 
ods to visually explore the hidden structure of multiplex networks preserving the 
inherent complexity. More specifically, we focus on one-mode networks, analyzing 
the corresponding set of adjacency matrices using the DISTATIS technique [1]. This 
represents an extension of the multidimensional scaling applied to a set of distance 
matrices derived on the same set of objects. It allows us to represent the different 
kinds of relationships (inter-structures) both in separate spaces and in a common 
space, called compromise. The proposed method enhances the visual exploration 
of: i) the network structure in terms of nodes’ similarity in each single layer, ii) the 
common structure of all layers, iii) the nodes’ variation across layers, and iv) the 
similarity among the structure of layers. 

The paper is organized as follows. In Section 2, the concepts and notations for 
multiplex networks and the analytic procedure to handle multiplex network data 
using the DISTATIS method are briefly presented. Section 3 discusses a real-world 
example along with the main results obtained using the proposed analytic procedure. 
Section 4 concludes with suggestions for future lines of research. 


2 Analyzing multiplex network data with DISTATIS 


Multilayer networks explicitly incorporate multiple kinds of interactions among 
nodes and constitute a natural environment to describe complex systems in which 
different sets of nodes, or the same set of nodes as in the multiplex networks, could 
be connected according to different kinds of relational motifs, with each layer rep- 
resenting a motif. In multilayer networks, it is possible to observe two sets of links: 
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i) the intra-layer connections, that is, the edges that remain inside each layer, and ii) 
the inter-layer connections, that is, the edges that cross the layers. 

More formally, a multilayer network ./ is a pair (4, €), with Y = {Gx}x=1,....k > 
the collection of K networks, and & = {Exn}k,h1=1,..,K> the collections of intra-layer 
edges, Ex = Ex, and of inter-layer edges Ex. In each layer, Gk = (Ve, Ex), with 
Vk = (Vik, - - - , Vnk) being the set of n nodes of each network, and Eg C Vy x Vy being 
the set of edges. 

Let .@ be a multiplex network where the set of nodes is fixed, that is, V; = V2 = 
-.-= Vx = V, and the inter-layer edges are constant, that is, Ex, = {(v,v);v E V}, 
Vk Æ h [8], then we consider from the network Gy € 4 the corresponding adjacency 
matrix Ax = (ajjx), with ajj = 1 if (v;,v;) € Ex, and a; = 0 otherwise, Vi # j. 
The set of the K adjacency matrices gives rise to a three-way relationship matrix 
A=(A1,...,Ax) that can be analyzed using one of the statistical methods designed 
for three-way data. 

In the present contribution, we adopt the DISTATIS technique [1], that is, a gen- 
eralization of multidimensional scaling in the STATIS approach [5] designed to ex- 
plore a set of distance matrices. The different relationships can be considered as 
different facets of a common underlying relational structure (corresponding to the 
compromise). Indeed, the technique allows analyzing both the relational structure 
embedded in each single layer and the global relational structure derived as a lin- 
ear combination of the layers with data-driven weights. Therefore, it provides a rich 
set of analytical and graphical results that also favor the comparison of the global 
structure and the single-layer structures. 

In order to extend DISTATIS to the study of multiplex network data, a three-way 
distance matrix D = (D,...,Dx) is derived from the adjacency matrix A, with 
dijk = geox(vi,vj) being the geodesic distance between the nodes v; and vj in the 
layer k if geox(v;,v;) < ©, that is, if the two nodes are reachable to each other — 
in the case of isolate nodes, we set dj jx = 2MaXxy;,»; geox(vi,v;). Other measures of 
distance or data transformations could be considered. For example, an alternative 
way to proceed is to calculate the complement of the adjacency matrix 1 — A — I, 
with a;; = 1 for pairs of non-adjacent nodes, and a;; = 0 for adjacent nodes, and 
where 1 is the all-ones three-way matrix. The effect of different distance measures 
on the proposed approach should be further investigated. 

Here, we consider the matrix ID of geodesic distances and the procedure de- 
scribed in Abdi et al. [1], suitable adapted to handle of multiplex networks. The 
derived compromise matrix represents a weighted average of the distance matrices 
using a double system of weights: the vi coefficients express the relative impor- 
tance of each layer in terms of inertia; whereas the a; coefficients measure such 
importance with reference to the similarity among the layers. The analytical results 
obtained by this technique will allow representing both the nodes related to each 
layer in the common reference space and each layer as a single point in the space 
defined by the first eigenvectors of the between-distance similarity matrix, high- 
lighting the similarity of the layers in terms of the whole network structure. 
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3 A real-world example 


Many multilayer networks are collected and used as examples to demonstrate the 
usefulness of new proposed methods [see, e.g., 4, 8, and references therein]. In or- 
der to illustrate how the DISTATIS technique works in practice for the treatment of 
multiplex networks, we consider a data set containing different kinds of online and 
offline relationships between 61 employees (out of 142) of the Computer Science 
Department at Aarhus University [AUCS data, 13], [4]. Five connections among 
employees were considered: co-authoring a publication [co-author]; being friends 
on Facebook [FB]; being involved in repeated leisure activities [leisure]; regularly 
eating lunch together [lunch]; and working together [work]. All relationships are 
undirected and unweighted. In addition, two attribute variables were measured for 
each employee, research group and academic position [i.e., professor, postdoc re- 
searcher, PhD student, and administrative staff]. 

Even if we observe each singleton dimension of collaboration among employees, 
with the proposed approach we can derive a unifying dimension of the underpinning 
concept as a whole. At the same time, every dimension (layer) tells us about local 
phenomena that can be analyzed and described in terms of an actor’s position in 
the network. Therefore, the following two research questions must be addressed: /) 
Are there groups based on the position in the networks and on relational similar- 
ities? 2) How similar are the network structures achieved by the different types of 
relationships? 

Starting with the five adjacency matrices of AUCS data, some DISTATIS results 
are summarized in Tables 1. The RV’s coefficients matrix among layers, with co- 
efficients that usually measure the congruence between two matrices [11], shows 
here the similarities between each pair of layers. The two layers lunch and work 
present the higher value. The factors’ scores (F1, F2), the eigenvectors (V1, V2), 
and the œ weights (that are closely related to each other) indicate the importance 
of lunch, co-authorship, and work relationships in defining the compromise struc- 
ture. In addition, the first two dimensions account for about 66% of the inter-layers 
dissimilarity. 


co-auth FB leisure lunch work Fl F2| VI V2| a 
co-auth 1.00 0.38 0.35 0.41 0.51 || 0.78 -0.11|0.50 -0.12]0.23 


FB 1.00 0.23 0.22 0.25 || 0.55 -0.72|0.35 -0.77|0.16 
leisure 1.00 0.32 0.27 || 0.59 -0.19]0.38 -0.21/0.17 
lunch 1.00 0.58 || 0.75 0.41|0.48 0.44/0.22 
work 1.00 || 0.78 0.36/0.50 0.39|0.23 


Table 1 DISTATIS results: RV’s coefficients matrix among layers, factors’ scores (F1, F2), eigen- 
vectors (V1, V2) for the five layers in the first two dimensions, and @ weights 


Based on the representation of the actors in the DISTATIS compromise factorial 
plan (Figure la), in which each employee is labelled by their academic position 
and colored according to the research group they belong to, some groups emerge 
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clearly. They are mostly consistent with the research group membership, even if 
some actors bridge different groups. The groups are instead mixed up with respect 
to the academic position. The factorial map in Figure 1b shows the role played by 
each layer in determining the final compromise space. Whereas every layer has an 
important (and positive) role in weighting the final configuration (let us look at the 
first component as a size-effect component), it is the second axis that reveals the real 
shape of our configuration. On the top, we can see the two prominent layers (lunch 
and work) that lie close and are separated by the co-authorship and leisure layers, 
located in the middle and opposite to the Facebook layer on the bottom. 


lunch, 


(a) DISTATIS compromise space (b) DISTATIS layers loadings 


Fig. 1 DISTATIS representation of actors and layers in the compromise space of AUCS data 


4 Concluding remarks 


In this work, we proposed an analytic method to treat multiplex network data based 
on factorial techniques. The results of the illustrative example indicate the high ex- 
plicative power of the DISTATIS technique in capturing similarities among rela- 
tionships. The possibility of measuring the inter-dissimilarity between layers allows 
the definition of a suitable subspace where comparisons at both the layer and node 
levels can be made. 

In conclusion, we provide some suggestions for future lines of research. The an- 
alytic results of the adopted approach are useful for the substantive interpretation of 
multiplex relationships. These findings could also be used to compute new measures 
for multiplex network data. Moreover, as network data allows for several ways of 
computing distances, a comparison of how different distance measures affect the re- 
sults and the visualization of compromise space should be addressed. The analyzed 
real-world example considers dichotomous, undirected, one-mode networks, and 
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the attribute data have been exploited only to enrich the interpretation. Extensions 
of the methods to deal with directed one-mode networks and two-mode networks 
could be of interest, as could the inclusion of attribute data in defining the analytic 
procedure. 


Acknowledgements The authors would like to thank Matteo Magnani (Uppsala University, Swe- 
den) for data availability. 
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Profiles of students on account of complex 
problem solving (CPS) strategies exploited via 
log-data 


Profili di studenti basati sulle strategie di problem solving 
complesso esplorate attraverso log-data 


Michela Gnaldi, Silvia Bacci, Samuel Greiff and Thiemo Kunze 


Abstract This paper aims at identifying profiles of students that are homogenous 
with regard to their ability to solve Complex Problem Solving (CPS) tasks, as as- 
sessed by the MicroDYN approach, a computer test made of 9 independent tasks, 
and administered to a sample of 6th and 9th grade Finnish students. For this aim, 
we estimate a discrete two-tier Item Response Theory (IRT) model. Results indicate 
that: (1) the conceptualisation of CPS as a three-dimensional variable is reasonable 
and (2) there are seven latent classes of students characterised by a specific profile 
with regard to the adopted CPS strategies, with students clustered in the higher la- 
tent classes having generally a higher CPS ability than the others, across the three 
CPS dimensions. 

Abstract L’obiettivo di questo articolo consiste nell’identificare profili omogenei di 
studenti rispetto alla loro capacità di risolvere problemi complessi, valutata con il 
test MicroDYN, un test al computer composto da 9 compiti indipendenti e sommini- 
strato a un campione di studenti finlandesi di sesto e nono grado. A questo fine, sti- 
miamo un modello di Item Response Theory (IRT) multidimensionale a classi latenti 
[2]. I risultati indicano che: (1) è ragionevole concettualizzare il CPS come variable 
tri-dimensionale e (2) sono osservabili sette classi latenti di studenti caratterizzate 
da uno specifico profilo in termini di strategia di risoluzione di compiti complessi 
adottata, con gli studenti raggruppati nelle classi latenti più alte che mostrano ge- 
neralmente maggiori capacità degli altri, rispetto a ciascuna delle tre dimensioni di 
CPS. 
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1 Introduction 


Complex Problem Solving (CPS) can be conceptualized in terms of a multidimen- 
sional latent variable. Buchner [5] defines CPS as “the successful interaction with 
task environments that are dynamic (i.e., change as a function of user’s intervention 
and/or as a function of time) and in which some, if not all, of the environment’s 
regularities can only be revealed by successful exploration and integration of the 
information gained in that process”. Key aspects contributing to characterize CPS 
- and also to differentiate it from reasoning - are then: (i) dynamicity, as dynamic 
interactions are necessary to reveal previous unknown information and to achieve 
goals using subsequent steps that depend upon each other, (ii) not all information 
necessary to solve the problem is given at the outset, (iii) the testee has to apply 
adequate strategies in order to actively generate information, and (iv) procedural 
abilities have to be used in order to control a given system. 

When interacting with a computer test to solve CPS tasks, students produce log- 
files, that is, finely grained data containing rich information on every single behav- 
ioral action they undertake. These pieces of information provide researchers, teach- 
ers, and policy makers with important insides about students’ proficiency and about 
how to support them in optimizing their cognitive potential [13]. The question of 
how log-files can be analysed to understand students’ levels of proficiency became 
central in 2012 when the computer-based assessment of complex problem solving 
was included in the Programme for International Student Assessment (PISA). De- 
spite that, the exploitation of this rich resource through log-file analyses is still in its 
infancy [10]. 

In this contribution, we analyse log-data drawn from the MicroDYN, a computer 
test to assess CPS ability, administered to a sample of 6th and 9th grade Finnish 
students. The log-data at issue have been subsequently transformed into three types 
of dichotomous items, each reflecting the three underlying dimensions of CPS, over 
nine different tasks. With these data at hand, we aim at identifying profiles of stu- 
dents that are homogenous as regard to their ability in CPS, which can be conceptu- 
alized in terms of a multidimensional latent variable (for a similar applicative work 
see for instance [8]). For this aim, we have to account in a suitable way for: (i) the 
different dimensions that characterize CPS, that is, exloration behaviour, knowledge 
acquisition, knowledge application, (ii) the discrimination power and the difficulty 
of items that measure CPS, so as to evaluate the capacity of each item to distinguish 
between individuals with different levels of CPS ability. To investigate the above 
mentioned objectives we estimate a discrete two-tier Item Response Theory (IRT) 
model [2] allowing us to (i) charachterise the latent classes of testees through the 
support points estimates over the accounted CPS dimensions and (ii) cluster testees 
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in the latent classes according to the maximum posterior class membership proba- 
bilities. 

This paper is organised as follows. In Section 2 we describe the data used, in 
Section 3 the model and in Section 4 we provide results and a brief discussion of 
them. 


2 The data 


The MicroDYN test is a computer test to assess CPS ability. The test is organized in 
9 independent tasks (named Lemonade, Drawing, Cat, Moped, Game, Gardening, 
Handball, Spaceship, and Aid, in line with the cover stories used for the different 
tasks), whose characteristics are varied in order to produce items across a broad 
range of difficulty [9]; each task lasts about five minutes, for a total testing time of 
less than one hour. 

When working on MicroDYN, participants face three different aspects directly 
related to the three facets of problem solving ability [11]: 


e exloration behaviour, denoting the use of adequate strategies; 
e CPS knowledge acquisition, denoting the knowledge generated; 
e CPS knowledge application, denoting the ability to control the system. 


For any of the tasks of the MicroDYN, during knowledge acquisition the testee 
has to discover the relations between input variables, that can be manipulated, and 
output variables. Then, the testee has to draw his/her mental model on the screen. 
The match of this drawing with the real underlying model is the MicroDYN score 
for knowledge acquisition. After the model is drawn, the testee can click on a finish 
button, which leads to the knowledge application phase in which the whole complex 
system is reseted and the correct model provided to the students to avoid empir- 
ical dependency between knowledge acquisition and knowledge application. Dur- 
ing knowledge application, the testee tries to reach given (and red indicated) target 
values on the output variables by changing the input variables until the maximum 
number of rounds are reached. 

The MicroDYN adopts a Multiple-Item-Approach, in which multiple control 
rounds can be used in each task. In this approach: 


e although participants work on a series of independent tasks with different goals, 
items in the test assessing knowledge acquisition gained during system explo- 
ration are related to the same underlying dimension and depend on one another. 
This is also the case for knowledge application items, such as using feedbacks 
in order to adjust behavior. Thus, variables within each of the dimensions (exlo- 
ration behaviour, knowledge acquisition and knowledge application) are depen- 
dent on each other; 

e Variables of the three kinds (exloration behaviour, knowledge acquisition, and 
knowledge application) within each task are likely to be dependent on one an- 
other; 
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e Additional knowledge that is eventually gained during knowledge application in 
a task does not help participants in resolving subsequent items in different tasks. 


The MicroDYN was administered to a sample of 6th and 9th grade Finnish students. 
In each grade level, around 2,000 students were tested. 

Each of the nine tasks in the MicroDYN test is characterized by three types of 
binary items, for a total of 27 items, as follows: 


Items of type 1: a score of 1 was given when the testee applied the Vary One 
Thing At a Time strategy (VOTAT; [15]) on all of the MicroDYN input variables 
during the exploration phase in a certain task, and 0 if this was not done; these 
items are affected by the exloration behaviour dimension of CPS; 

Items of type 2: a score of 1 was given if the model drawn by the testee and related 
to a certain task was completely correct, and 0 otherwise; these items are affected 
by the knowledge acquisition facet; 

Items of type 3: a score of 1 was given if target areas of all variables in a certain 
task were reached,and 0 if this was not done; these items are affected by the 
knowledge application dimension. 


3 The model 


We aim at investigating the above mentioned objectives through the estimation of 
a discrete two-tier Item Response Theory (IRT) model [2]. The model at issue is 
characterized by two independent vectors of latent variables that are measured on 
each student i (i = 1,...,n) through the responses on J = 27 items related to 9 
different tasks: 


e latent variables U; = {Uia} with dı = 1,...,Dj: in our case, Dj = 3 with Uj 
denoting exloration behaviour, U; denoting CPS knowledge acquisition, and U;3 
denoting CPS knowledge application; 

e latent variables V; = {Via,} with dz = 1,...,Dz: in our case, D2 = 9 with each 
Via, denoting a latent variable accounting for correlation among responses of 
individual i on items related to the same task do. 


Latent variables in U; are assumed to have a discrete distribution, with a finite 
number kı of support points u1,...,ux, and corresponding mass probabilities (or 
weights) A1,...,,. Each support point un, = (Uny1,.--;Unyp,)’ (hi = 1,...,k1) has 
a particularly nice interpretation, as it denotes an unobservable cluster (or latent 
class) of individuals having an homogenous profile in terms of attitude toward CPS. 
Besides, each weight A, denotes the latent class membership probability, that is, 
An, = P(U; = un, ). Similarly, latent variables in V; have a discrete distribution with 
k2 support points v1,...,vg, and related weights 71,...,7%,, with Zp, = p(Vi = Vm ) 
(m = 1,...,k2). 

Items of type 1, 2, and 3 are related to the latent variables U; and V; through 
an IRT parameterization. In more detail, responses to items of type 1 depend on la- 
tent variable U;1; responses to items of type 2 depend on latent variable Uj2, and 
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responses to items of type 3 depend on latent variable U;3. Moreover, each la- 
tent variable Vig, is measured by the three types of items related to the same task 
dy (d2 = 1,...,9). These relations are formalized by introducing the disjoint sets 
U\,...,Up, and ¥1,...,Vp,, where Yq, contains the indices of items affected 
by latent variable Uja, (di = 1,...,D1) and, similarly, 4, contains the indices of 
items affected by latent variable Vig, (d2 = 1,...,D-). The detailed relations among 
latent variables and items are displayed in Figure 1, where Y;; denotes the response 
of person i to item j (i= 1,...,n,j=1,...,5). 


Fig. 1 Path diagram of the discrete two-tier model with Dj = 3 latent variables U;, D2 = 9 latent 
variables V;, and J = 27 items (three types of item for each task). 


Being all items dichotomously scored, we adopt a Two-Parameter Logistic (2- 
PL) parameterization [4], as follows (i = 1,...,n, 7 =1,...,/) 


DI Dy 
logit p(¥ij = 1|U;=un,,Vi=vm)=%j Y UGE Va Suma thi Y HJ Ea Wma — Bj, 
dizl del 
(1) 


where, as usual in the IRT parameterization, Y,j and ~; are the discrimination pa- 
rameters related to variables in U; and Vj, respectively, and B; is the difficulty 
parameter of item j; 1{-} is the indicator variable. Parameters of model at issue 
may be estimated by means of the maximum likelihood approach, efficiently imple- 
mented through the Expectation-Maximization (EM) algorithm [6] in the R pack- 
age MLCIRTwithin [3]. It has to be noted that the model identification requires 
the specification of suitable constraints on the item parameters and/or the support 
points; for details, see [2]. 
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We also outline that formulation of equation (1) is completely general and it al- 
lows us for any number of components in U; (other than in V;). More in detail, in 
the following we compare in terms of goodness-of-fit the proposed model character- 
ized by Dj = 3 latent variables U;1,U;2,U;3 with the following nested alternatives: 
(i) model with Dj = 2 elements in U;, being Uj, and Un collapsed in a same dimen- 
sion; (ii) model with just Dj = 1 element in Uj, being all items affected by only one 
latent variable (Uj, U;2, U;3 collapsed). 

It is also worth to be noted that the number of support points of U; and V;, that 
is, kı and k2, do not represent model parameters, but they have to be a priori fixed. 
For this aim, following the main stream of the literature (see mainly [12]) we base 
our choice on the Bayesian Information Criterion (BIC; [14]) and on the Akaike’s 
Information Criterion (AIC; [1]). In both cases, as smaller the corresponding indices 
are, as better it is. In practice, we fix kz = 2 as we are not particularly interested in 
clustering individuals according to latent variable V; and we estimate model in (1) 
for increasing values of kı until the index does not start to increase. Then, we select 
the previous value of kı as the optimal number of latent classes, which guarantees 
the best compromise between goodness-of-fit and model parsimony. 

To summarize, as main results of the model estimation we obtain: 


e difficulty and discriminating parameters for each item, having the usual interpre- 
tation as in the traditional IRT models, 

e support points and weights for the latent classes, 

e posterior class membership probabilities for each individual, which are used to 
cluster individuals in the latent classes according to a suitable criterion (usually, 
the maximum a posteriori one). 


4 Application and discussion of results 


As stated in the previous section, within the model at issue, the latent variables 
in U; are assumed to have a discrete distribution, with a finite number of latent 
classes, kı. Thus, the first step of our application consists in selecting the number 
of latent classes k,, that is, the number of classes of units (i.e., students) of our 
data. We make this selection by assuming 3 distinct dimensions, Dj = 3, that is, 
U;1,U;2,U;3. For this aim, we estimate the BIC index for an increasing number of 
latent classes, and we select as optimal number the one for which we observe the 
first lowest value for BIC. In our case, as the minimum value for BIC is observed 
in correspondence of the model for which kı = 7 (i.e., 59236.58), we select this 
number of latent classes. Differently, for the other dimensions in V;, which account 
for the possible correlation of the 3 items in each task, we assume for simplicity a 
fixed number of latent classes, that is ky = 2. 

Afterwards, we check for the initial assumption of 3 distinct dimensions under- 
ling CPS by comparing through BIC and AIC the proposed model characterized by 
Dı =3 latent variables (U;1, U;2,U;3) with a model with Dj = 2 elements in U;, being 
U;, and Ujz collapsed in a same dimension (i.e., exploration is directly transferred 
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into knowledge acquisition) and with a model with Dj = 1 element in U;, being Uj;, 
Un, Ui collapsed in the same unique latent variable (i.e., CPS is a unidimensional 
latent variable). We select as best model the one for which we observe the mini- 
mum value for BIC and AIC, that is, a three-dimensional model, as shown in the 
table below. This choice implies that it is reasonable to assume that CPS is a mul- 
tidimensional variable made of three distinct facets, that is, exloration behaviour, 
knowledge acquisition, and knowledge application. 


Table 1 Estimated BIC and AIC for the three models with Dj = 3 (CPS is a three-dimensional la- 
tent variable), Dj = 2 (CPS is a two-dimensional latent variable), Dj = 1 (CPS is a uni-dimensional 
latent variable). In boldface the smallest values. 

D;=3 Dy=2 D,=1 


BIC 59236.58 59347.67 59478.31 
AIC 58708.66 58850.09 59011.07 


The key step of our application consists in estimating the support points (41, ...,ux,) 
and corresponding weights (A,,...,A,,), denoting the class membership probabili- 
ties, given the Dj = 3 dimensions underlying CPS, as shown in Table 2. 


Table 2 Support points for the seven latent classes (hı = 1,...,7) and the three dimensions (Uj1, 
Un, U;3) of CPS. 


hy 1 hy 2 hy 3 hy 4 hy 5 hy 6 hy 7 


Uj -1,5000 -1,0000 -0,5000 0,0000 0,5000 1,0000 1,5000 
Un -0,2228 -0,6981 0,7389 1,1304 1,5816 2,1438 4,9642 
Ui 1,7671 -0,7089 1,0431 3,1116 1,8435 3,6384 4,7766 


Ak, 0,3000 0,1849 0,0810 0,0837 0,1714 0,1481 0,0309 


Support points provide a rich information, as the higher the support points, the 
higher the CPS ability of the testees clustered in each class to resolve the complex 
tasks involved in the three dimensions underlying CPS. It can be noticed that the 
support points tend to increase when moving from the lowest latent classes (i.e., 
hı = 1) to the highest (i.e., hi = 7), meaning the testees clustered in the highest 
latent classes have a higher ability than the others, over all the three dimensions. 
However, there are also cases of support points not respecting this general observed 
rule. For example, let us consider the first two latent classes (h; = 1 and hy = 2). In 
the first latent class (hı = 1) there are students who are the poorest as regard Uj, 
but (i) not the first poorest as regard Un (i.e., -0,2228 is the second lowest value) 
and (ii) close to average as regard U;3. In the second latent class (h; = 2) there are 
students who are the second poorest as regard U;1, and the poorest as regard Uj2 and 
U;3. Overall, these analyses allow us to profile testees on account of the adopted 
strategies of CPS. 
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Characterising Italian municipalities according 
to the annual report of the 
prevention-of-corruption supervisor: a Latent 
Class approach. 


Caratterizzazione dei comuni italiani sulla base delle 
relazioni annuali dei Responsabili Prevenzione 
Corruzione: un approccio a classi latenti. 


Michela Gnaldi and Simone Del Sarto 


Abstract This work aims at characterising Italian municipalities according to what 
has been accomplished in terms of corruption prevention. The recent “anti-corruption 
law” of 2012 establishes a new plan for corruption prevention. It introduces a new 
figure, the prevention-of-corruption supervisor who reports if and how preventive 
measures are implemented within the public institution he/she represents, by filling 
in a standardised form, which has to be published in the institution website. We 
rely on these data — downloaded from each single municipality website — to apply a 
Latent Class model allowing us to identify groups of municipalities with a similar 
behaviour. Further, we qualify such classes on account of several covariates. First 
results show that i. there is a general tendency among municipalities to fulfil the 
prevention-of-corruption law and ii. virtuous municipalities are large municipalities 
experiencing at least one corruption event. 


Abstract L’obiettivo del presente lavoro é caratterizzare i comuni italiani in base 
a quanto realizzato in termini di prevenzione della corruzione. La recente legge 
“anti-corruzione” del 2012 stabilisce un nuovo programma per la prevenzione della 
corruzione e introduce la nuova figura del responsabile della prevenzione della 
corruzione. Mediante una scheda standardizzata, il supervisore riferisce le moda- 
lità con cui sono state implementate misure preventive della corruzione all’interno 
dell’amministrazione pubblica che rappresenta. I dati raccolti da tali schede — sca- 
ricate singolarmente dai siti istituzionali dei comuni campione — sono analizzati 
mediante un approccio a Classi Latenti, che consente di identificare gruppi di co- 
muni con comportamenti simili. Inoltre, tali classi sono qualificate in base ad al- 
cune covariate. I primi risultati mostrano i. una tendenza generale tra i comuni a 
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rispettare la legge per la prevenzione della corruzione e ii. i comuni più virtuosi 
sono grandi comuni con esperienza di eventi corrutivi. 


Key words: prevention of corruption, Latent Class model 


1 Introduction 


Worldwide corruption is a plague affecting the economic, social and institutional 
development of a Country and political institutions have attempted to implement 
measures to contrast and contain this phenomenon [1]. Corruption is latent by na- 
ture. People involved in corruptive activities try to hide or falsify information and, as 
a consequence, its measurement is very complex [4]. However, there are measures 
of corruption, the so-called “perception-based” and “non-perceptual” measures. The 
former are based on the subjective perception of corruption provided by experts and 
are generally expressed at a country-level: an example is the well-know Corrup- 
tion Perceptions Index (CPI). The latter are objective indexes based on proxies, i.e., 
market or statistical indicators tied with corruption — such as the price of input pur- 
chased by a public administration, the rate of criminal convictions to public officials 
for crimes related to corruption, and so on. The pros and cons of these two types of 
corruption measurement tools are commonly known, but these details are out of the 
scope of this work. 

In Italy, with the recent law n.190 of 2012, named “Provisions for the prevention 
and repression of corruption and lawlessness in the public administration”, each 
public institution has to adopt a three-year plan for corruption prevention (“Piano 
Triennale per la Prevenzione della Corruzione”, PTPC), which provides an assess- 
ment of the different exposure levels of offices to the risk of corruption and specifies 
the organisational changes designed to prevent such risk. To this general aim, each 
institution selects a supervisor, indeed called prevention-of-corruption supervisor 
(“Responsabile per la Prevenzione della Corruzione”, RPC). Among his/her tasks, 
the supervisor has to fill in an annual report about the efficacy of the prevention mea- 
sures defined by the PTPC. Such report is filled in through a questionnaire, made 
available in spreadsheet format by the Italian National Anti-Corruption Authority 
(ANAC) and has to be uploaded in the “Transparent administration” section of the 
public institution website. 

In this work, we aim at classifying and qualifying a sample of municipalities 
according to what stated in the RPC form. Since it summarises what each institution 
has accomplished to prevent and contrast corruption, according to the PTPC, our 
purpose is to cluster municipalities in homogeneous groups as regards to the adopted 
anti-corruption measures. To this aim, we rely on the Latent Class (LC) model [2, 3], 
which allows us to cluster units with a similar behaviour on account of a latent and 
unobserved characteristic (i.e., corruption). 

The following of this paper is organised as follows. In Section 2 the RPC form 
data considered here are briefly described, while the LC model is introduced in 
Section 3. Results are shown in Section 4 and some concluding remarks are given 
in Section 5. 
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2 The RPC form data 


The data we consider in this work are collected through the RPC forms filled in in 
2015. Each of them has been downloaded individually from the section “transparent 
administration” of the municipality institutional website. Overall, our sample size 
is made of 232 municipalities, comprising all Italian province municipalities, all 
the other municipalities with at least 40,000 inhabitants and particular “advised” 
municipalities, as stated by the ANAC act n.71 of 2013. 

The RPC form has several sections reflecting different aspects about the efficacy 
of prevention measures, defined in the PTPC adopted by each institution. In this 
work we consider the responses to the opening questions of each section, requiring 
the institution at issue to state whether it has accomplished the required activities. 
The questions are related to the following contents (in square brackets it is reported 
the original label within the RPC form): 


1. monitoring the sustainability of all measures, general and specific, identified in 
the PTPC [2A]; 

2. specific measures, in addition to mandatory ones [3A]; 

. computerising the flaw to fuel data publication in the “transparent administra- 
tion” website section [4A]; 

4. monitoring data publication processes [4C]; 

5. training of employees, specifically dedicated to prevention of corruption [5A]; 

6 

7 


(SS) 


. staff turnover as a risk prevention measure [6B]; 
. checking the truthfulness of statements made by parties concerned with unfitness 
for office causes [7A]; 
8. measures to verify the existence of incompatibility conditions [8A]; 
9. prearranged procedures for issuing permits for assignments performance [9A]; 
10. reporting the collection of misconduct by public administration employees (whi- 
stleblowing) [10A]. 


Three possible answers can be provided to these questions: “Yes” (labelled as 1), 
“No, but expected by the PTPC” (2) and “No, not expected by the PTPC” (3). As 
can be noted, the second response is the least virtuous answer and the other two 
correspond to actions in line with the PTPC. 


3 The Latent Class model 


The Latent Class (LC) model is one of the most well-known latent variable models. 
It is often used to classify units of a sample in homogeneous groups, according to 
a set of categorical variables (e.g., the responses to questionnaire items). Such vari- 
ables represent the observable manifestation of a unique underlying latent variable. 

Considering a sample of n units, let Y; = [Y;1,...,Y; nu denote the random vector 
of the responses provided by unit i to the J items of a questionnaire, with i= 1,...,n. 
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Each Yj; is a categorical variable with / categories, generally labelled starting from 
0 to /— 1. In this paper, we consider / = 3 since three different response modalities 
can be provided to the selected RPC form items (see Section 2). However, we retain 
the original response labels (1,2,3). 

The LC model assumes the existence of one discrete latent variable C; with the 
same distribution for each unit i. This latent variable is based on k support points. 
Each point has a specific prior probability, denoted by 7.,c = 1,...,k and corre- 
sponds to a latent class in the population. Furthermore, the conditional probability 
that unit i, belonging to class c, provides response y to item j is: 


jO) = PY; =yICGi = c), J=H1,...,5, y=0,...,/-1, c=1,...,k. 


Moreover, it is assumed local independence between the response variables Y;;: 
this hypothesis states that the response variables are conditionally independent given 
the latent class. This implies that the probability of observing the response vector 
Vi= Dil vi all, given that unit / is in latent class c, can be formulated as the prod- 
uct of each conditional probability reported above, over the J items. Specifically, we 
have: 


J 
P(y;lc) = PY; =y IC; =c) =[] 90y). 
j=l 


Then, the manifest probability of y; can be obtained as follows: 


k 
P(y;) =P(¥i=y;) = L P(y;|c)te- 


It is often of interest to rely on an allocation rule, allowing to assign each sample 
unit to a particular latent class, given its response pattern. Such procedure is based 
on the posterior probability that unit i belongs to class c, given the response vector 
y;. It can be obtained using the Bayes” theorem, as follows: 


P (y i |c) Te 

PY) 
In particular, each unit is assigned to the latent class according to its largest posterior 
probability. 


P(cly;) = P(C: = cl¥i = y;) =1,...,k. (1) 


4 Results 


Relying on the BIC index, we can identify k = 2 latent classes, for which the fol- 
lowing prior probability estimates can be obtained: 7 = 0.482 and î, = 0.518. 

In Table 1, the estimated conditional response probabilities ô jje(y) are reported 
for each item j and for each response category y (1,2,3). The conditional proba- 
bilities of a positive response (y = 1) are often high (> 60%) for both the latent 
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Table 1 Estimated conditional response probabilities $ ;\c(y) for the selected ten items of the RPC 
form. 


j c y=l y=2 y=3 j c y=l y=2 y=3 
1 1 0875 0.036 0.089 6 1 0.549 0.185 0.266 
2 0.660 0.059 0.281 2 0.387 0.204 0.409 
a 1 0.815 0.000 0.185 q 1 0.724 0.042 0.234 
2 0.615 0.075 0311 2 0.077 0.187 0.736 
3 1 0.811 0.055 0.134 g | 0.817 0.001 0.181 
2 0.582 0.069 0.348 2 0.031 0.187 0.782 
4 1 0.909 0.000 0.091 g 1 0.906 0.045 0.049 
2 0.886 0.052 0.062 2 0.804 0.033 0.163 
5 1 0.926 0.051 0.023 so 1 0850 0.092 0.058 
2 0.860 0.119 0.020 2 0.645 0.133 0.222 


classes, meaning that such answer has been much frequent in our sample for almost 
all ten items. However, units belonging to the first latent class (c = 1) have larger 
conditional probabilities to answer “Yes” than units in the second (c = 2), since we 
can observe Êj (1)> d;2(1). for j= 1,...,10, even if the difference between these 
probabilities is sometimes negligible (see items 4 and 5). On the contrary, units in 
the second latent class generally exhibit a lower probability of affirmative responses 
and higher for the second and third response categories (y = 2 and y = 3). Then, 
units in this group have a higher probability not to accomplish the activities listed in 
the form than units in the first class, mainly because such activities are not expected 
by the PTPC, since $;12(3) > d;2(2). Overall, looking at Table 1, it can be stated 
that the two subpopulations especially differentiate as regards the outcomes of items 
7 and 8, since units in the first latent class answer “Yes” to these two items with a 
high probability, while for those in the second latent class the most likely response 
is the third. 

Finally, according to the posterior probabilities computed using equation (1), 
each unit is assigned to one of the two latent classes. Such classification is crossed 
with some variables characterising the sample units, in order to further qualify the 
latent classes. Among others, two variables show an interesting relation: the occur- 
rence of a corruptive event and the population size. The former is directly obtained 
from the RPC form (question 2B), while the latter is obtained dividing the popula- 
tion in quartiles according to municipality resident population. 

As reported in Table 2, even if only 33 municipalities over 213 state at least one 
corruptive event (18.3%), we can observe that most of them (69.7%) belongs to the 
first latent class, while, among units with no events, the allocation between the latent 
classes is close to 50%. As far as the municipality population size is concerned, a 
trend in the latent class partition is observable along the quartiles of the population. 
In particular, the first quartile of municipalities (hence the least populated) is mostly 
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Table 2 Cross-tables between latent class membership and occurrence of corruptive events within 
the municipality (a) and size of municipality (b). Values in parenthesis represent row-percentages. 


(a) (b) 

Latent class Latent class 
Events 1 2 Total Size 1 2 Total 
none 79 (43.9) 101 (56.1) 180 1 20 (34.5) 38(65.5) 58 
atleastone 23 (69.7) 10(30.3) 33 2 27 (46.6) 31(53.4) 58 
n 3 29 (50.0) 29(50.0) 58 
Total 102 111 213 4 33 (56.9) 25 (43.1) 58 
Total 109 123 232 


*: 19 units have missing response in the occurrence of corruptive events. 


characterised by units belonging to the second latent class (65.5%), while the last 
quartile (the most populated) has more units belonging to the first (56.9%). 


5 Conclusions 


This work is the first attempt to extensively analyse the richness of information in- 
cluded in the annual relations filled in by the prevention-of-corruption supervisors. 
Our purpose is to study the behaviour of a sample of Italian municipalities in adopt- 
ing measures to contrast corruption. A Latent Class analysis is performed in order to 
cluster municipalities into homogeneous groups, according to their behaviour as re- 
gards corruption prevention. Two groups of municipalities are highlighted of which 
the first collects the most virtuous municipalities. 

Furthermore, two variables can be considered important in qualifying the two 
latent classes: the occurrence of corruptive events within the institution and the mu- 
nicipality population size. In particular, the most virtuous class is characterised by 
the most populated municipalities, which experienced at least one corruptive event. 
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A proposal of a discretization method applicable 
to Rasch measures 


Proposta di un metodo di discretizzazione applicabile alle 
misure ottenute tramite il modello di Rasch 


Silvia Golia 


Abstract The aim of this paper is to propose a discretization method which can be 
applied to measures obtained using the Rasch model or, more in general, a model 
belonging to the class of the IRT models. The motivation of this proposal lies in 
the fact that there are methodologies that work with discretized variables, one such 
example is the Bayesian Networks. The idea is to use the informations from the 
Rasch model in order to forecast the answer of a subject to a representative item, 
and this answer represents the category assigned to the subject in the categorized 
version of his/her latent trait. In order to verify the goodness of this proposal, the 
new discretized variable is compared with a global single-item measure, under the 
hypothesis that this item is a possible observed discretization of the latent variable. 
Abstract Lo scopo di questo lavoro é quello di proporre un metodo di discretiz- 
zazione applicabile a misure ottenute utilizzando il modello di Rasch o, più in gen- 
erale, modelli di tipo IRT. La ragione di questa necessità risiede nella constatazione 
che vi sono metodologie che richiedono variabili categorizzate; un esempio è dato 
dalle Reti Bayesiane. L’idea è di sfruttare le informazioni del modello di Rasch 
per prevedere la risposta di un soggetto ad un item rappresentativo, e tale risposta 
rappresenta la categoria assegnata al soggetto nelle versione categorizzata del suo 
aspetto latente. Per verificare la bontà della proposta, la nuova variabile discretiz- 
zata viene confrontata con le risposte ad un item globale, ipotizzando che questo 
rappresenti una possibile discretizzazione osservata della variabile latente. 


Key words: Discretization, Rasch measure, Global single-item 
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1 Introduction 


Many statistical learning algorithms require only categorical variables or produce 
better models if the variables involved in the analysis are not continuous. However, 
many real databases include continuous as well as categorical variables, so it is nec- 
essary to reduce these continuous variables to discretized ones. In socio-economic 
and psychological contexts it is common to take into account latent variables that 
can not be measured in the standard way, as, for example, weight or height. In order 
to measure a latent trait in general an ad hoc questionnaire is prepared and admin- 
istered to a sample of the target population. One approach that allows to estimate 
a measure of the latent trait is the so called item response theory (IRT) approach. 
The IRT approach is based on the idea that the probability of response in any one 
of two or more mutually exclusive categories of an item is a function of the subjects 
location on the latent continuum representing the latent trait of interest and of some 
estimable parameters characteristic of the item. The problem addressed in this paper 
is to find a good method able to discretize continuous measures estimated applying a 
model that follows the IRT approach. The resulting discretized variable must mimic 
the evidence that higher scores correspond to higher levels of latent trait, so the or- 
dering of the categories matters. The measurement model considered in this paper is 
the Rating Scale Model (RSM) ([1]) which belongs to the family of Rasch models. 
It converts raw scores into linear and reproducible measurement and its distinguish- 
ing characteristics are: separable person and item parameters, sufficient statistics for 
the parameters and conjoint additivity; prerequisites are unidimensionality and local 
independence. If the data fit the model, then the obtained measures are objective and 
expressed in logits. ([7]). Following the RSM, given an item i with m + 1 response 
categories (c = 0,1,---,m), the probability of the subject s with level of latent trait 
Ps to respond in category c is given by: 


exp {c(B:-è)-Lj-0t} 
Loep {kK(B.-è)-Li_0t)} 


P(Xsi = c) = Psic = (1) 


where ô; represents the difficulty of item i and 7; is called threshold (to = 0 and 
Li, Tj = 0). 

The idea underlying this paper is to use the information obtained from the appli- 
cation of equation 1, that is the estimates of the B, and Tų parameters, in order to 
forecast the answer of the subjects to a representative item. These answers represent 
the categories assigned to the subjects in the categorized version of their latent trait. 


2 Discretization of a Rasch measure 


In order to discretize a continuous variable so that the obtained categories are ordi- 
nal, common methods are the equal-width and the equal-frequency discretization. 
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After being sorted the n observations of the variable, the equal-width discretization 
algorithm consists in dividing the range of the variable x, represented by the interval 
[x1,xn], into a k predefined number of equal width discrete intervals and assign- 


ing the level j to the subject i if and only if xı + SM) Gran) < xi L xı + ina 
The equal-frequency discretization algorithm consists in dividing the range of the 
variable x according to a user-defined number of intervals, k, delimited by the 
1/k,2/k,...,(k— 1)/k empirical quantiles Q so that the subject i is assigned to the 
level j if and only if Qj), < Xi < Q(j+1)/k- 

The method proposed in this paper originates from the consideration that the 
above methods keep no trace of the way with which the measure has been obtained. 
The measure was estimated using joint maximum likelihood estimation under the 
constraint that the sum of the difficulty parameters 6; was set equal to zero. An item 
with difficulty equal to zero corresponds to an item with average difficulty, so this 
item seems to be a representative one, able to discriminate the subjects. 

Given the estimated thresholds {%}/"_,, the estimated measure of the latent trait 


{ Bes and 6; = 0, the equation 1 allows one to calculate the response probability 


record { Psic} o for each subject s. The discretized version of È, for the subject s, 
bs, is represented by the most probable response category, that is 


bs = arg MAX Psic- (2) 


3 Evaluation of the goodness of a discretization method 


In order to verify the goodness of the proposed discretization, the new, discretized 
variable has been compared with a Rasch single-item measure. Intuitively, the global 
single-item measure can be seen as an approximation of the discretized version of 
the latent trait measured by the set of items that are proposed for this purpose; for 
example, the global item for the job satisfaction is ’’How satisfied are you with your 
job as a whole?” and it is reasonable to admit that the respondent can implicitly 
make a synthesis of his/her job satisfaction when answers to this question. More- 
over, there is literature that has explored the use of a global single-item as a measure 
of latent constructs (see for example [4]); depending on the nature of the construct 
operationalized, global single-item measure is often adequate for the purpose. 

In order to evaluate how the discretized measure performs with respect to the 
global single-item measure, it is necessary to fix a metric able to quantify the resem- 
blance between global single-item and discretized measures. Let O, be the response 
to the global single-item given by the subject s and let D; be the discretization of 
the estimated measure È, for the subject s. A first indicator of resemblance is the 
percentage of perfect match between the Os and the Ds. Let S, be the indicator of 
perfect match for the subject s so that Ss = 1 if O; = Ds; the percentage of perfect 
match is given by (L_, Ss/n) * 100, where n indicates the sample size. This indica- 
tor does not take into account the ordering of the categories, so two other measures 
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of resemblance can be considered. The first one is the Mean Absolute Difference 
between the Os and the Ds, that is: Y_,|Os — Ds|/n, and it gives an idea of the 
mean distance between the two variables. The second one is the Similarity index 
proposed by Gower ([5]); it is a normalized indicator with values near 1 indicating 
high degree of similarity. 


4 First evidences from real data and discussion 


This section reports the first evidences that the proposed method performs better 
than the two standard ones considered. Five psychological constructs were taken 
into account and analyzed. The data regarding worker satisfaction and distributive 
fairness come from the Survey on the Italian Social Cooperatives carried out in 2007 
([3]). The respondents were paid workers employed in Italian social cooperatives 
of type A and B. The burn-out data come from a survey held in 2009 concerning 
social workers working in Veneto (Northern Italy) ([2]). The data concerning the 
avoidant attachment come from a survey carried out between March and June, 2016, 
at three nursing homes located in Lombardia (Northern Italy). The respondents were 
auxiliary nurses. The data regarding life satisfaction come from the Opinions and 
Lifestyle Survey ([6]); the respondents were components of household aged 16 and 
over living in Great Britain. The data were collected between April and May, 2015. 

Table 1 reports summary information about the estimates of the measures of the 
latent traits considered and the estimated threshold parameters 7; . 


Table 1 Summary of the information derived from the application of RSM 


Latent Trait Number £E(f) Skewness - Thresholds 

Subjects (std) Kurtosis of B 
Worker Satisfaction 3980 0.89 (1.42) 0.87 - 4.79 -1.71; -0.94; 0.11; 2.54 
Distributive Fairness 3666 -0.68 (1.88) 0.13 - 3.90 -2.86; -2.09; -1.18; 2.03; 4.11 
Burn-out 770 -1.55 (1.90) 0.52 - 5.14 -9.93; -0.73; 0.08; 4.57 
Avoidant Attachment 107 -1.20 (1.37) -0.94 -3.52 -1.24; -0.52; -0.04; 1.80 
Life Satisfaction 2042 1.47 (1.43) 0.54 - 4.02 -1.72; -1.30; -0.09; 3.11 


Table 2 reports the values of the three measures of resemblance considered. The 
discretization method proposed in this paper outperforms the two standard ones with 
respect to perfect match as well as similarity between discretized and observed” 
measures. 

These first results suggest that the developed method has chances to give a good 
discretization of the underlying latent variable measured by a Rasch model. Fur- 
ther investigation is needed, exploring new datasets as well as other discretization 
methods, to confirm what shown in this paper. 
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Table 2 Measures of resemblance between the global single-item and discretized measures 


Latent Trait / % Perfect Mean Absolute Similarity 


Di 


scretization Method Match Difference Index 


Worker Satisfaction 


Rasch Discretization 56.4 0.511 0.872 
Equal-width Discretization 20.4 1.005 0.749 
Equal-frequency Discretization 28.5 1.199 0.700 
Distributive Fairness 
Rasch Discretization 60.5 0.452 0.910 
Equal-width Discretization 44.1 0.622 0.876 
Equal-frequency Discretization 37.1 0.783 0.843 
Burn-out 
Rasch Discretization 60.6 0.440 0.890 
Equal-width Discretization 53.1 0.495 0.876 
Equal-frequency Discretization 44.9 0.697 0.826 
Avoidant Attachment 
Rasch Discretization 56.1 0.579 0.807 
Equal-width Discretization 16.8 1.383 0.654 
Equal-frequency Discretization 29.0 1.009 0.748 
Life Satisfaction 
Rasch Discretization 70.5 0.315 0.921 
Equal-width Discretization 24.0 0.883 0.779 
Equal-frequency Discretization 26.4 1.207 0.698 
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Tree-based Non-linear Graphical Models 
Modelli Grafici non lineari basati su alberi 


Anna Gottard 


Abstract Graphical models are statistical models that are associated to graphs 
whose nodes represent variables of interest. The absence of an edge between two 
nodes corresponds to a conditional independence between the variables. In this 
work, I propose a class of graphical models for non-linear systems, where the shape 
of dependence is modelled by a Bayesian additive regression tree model. The pro- 
posed models are able to detect nonparametrically both non-linearities and interac- 
tions and are suitable for high dimensional data. 

Abstract / modelli grafici sono modelli statistici associati a grafi i cui nodi rapp- 
resentano le variabili di interesse. L'assenza di connessione tra due nodi implica 
una indipendenza condizionata tra le rispettive variabili. In questo lavoro, pro- 
pongo una classe di modelli garfici per sistemi non lineari in cui la forma della 
dimendenza è dettata da un modello bayesiano additivo di aleberi di regressione. 
Questi modelli consentono di cogliere in modo non parametrico sia non linearità 
che interazioni tra le variabili e sono adeguati anche per dati a grandi dimensioni. 


Key words: Graphical models, graph learning, non-linear systems, Bayesian addi- 
tive regression trees. 


1 Introduction 


Graphical models (see [9] and [5], among others) have been widely utilised in many 
domains to study the conditional independence structure of a set of random vari- 
ables. Random variables could concern, for instance, expression values of genes, 
metabolites, personal opinions on specific topics and so on. 

In the graph, random variables are represented as nodes and conditional inde- 
pendence as missing edges. Graphs are characterised by the type of their edges as 
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undirected, directed and mixed. Moreover, they are characterised by the type of 
variables, as continuous, discrete and mixed. 

Most of the literature on learning the structure of the graph from data focused on 
concentration graph models, undirected graph models with random variables fol- 
lowing a multivariate normal distribution. This assumption implies that the relation- 
ship among variables is linear and learning the graph corresponds to assess which 
partial correlation coefficient is zero. 

The challenge of learning graphs when the relationship among variables is not 
linear has been tackled recently by many authors. In particular, [11, 12] propose 
graphical models for Nonparanormal random variables, a semiparametric Gaussian 
copula. In the case of directed acyclic graphs, [15] the use of non-linear structural 
equation models. Further interesting work connects graphs with regression trees 
(see [13] and [6]). Tree-based models are very attractive as they can model both 
non-linearities and interactions, but special caution has to be paid on the algorithm 
to detect the variable importance. As a matter of fact, a greedy search can sometimes 
bring to misleading results. A cumbersome situation is given, for instance, when the 
true graph is the one presented in Figure 1, for particular values of the parameters. 

In this work, I consider the problem of Bayesian estimation and learning an undi- 
rected graph for continuous or mixed variables in the presence of non-linearities 
and interactions. As [13] and [6], my proposal is based on a tree ensemble model, 
Bayesian Additive Regression Trees (BART) models [4, 2]. The model is nonpara- 
metric and specifies the expected value as the sum of trees. The Bayesian Backfit- 
ting Monte Carlo Markov Chain procedure [7] for estimation avoids the necessity of 
greedy search algorithms. This approach provides full posterior inference, includ- 
ing credible intervals for model parameters. The structure learning of the undirected 
graph adjacency matrix is achieved by node-wise regression and the local Markov 
property. The graph learned according to this procedure is sometimes called depen- 
dency network [8]. 


2 The proposal 


Consider the collection of random variables X,, v € V = {1,...,p} whose condi- 
tional independence structure is depicted in a graph Y = (V, E). The set V collects 
the nodes of the graph and the set E collects the edges connecting the nodes. 

The graph is ruled by a set of Markov properties. The local Markov property 
for undirected graphs assesses that each variable X, in the graph is conditionally 
independent of all the other nodes given its neighbours. The set of neighbours of 
a node X,, denoted with ne(v), gathers all the variables whose nodes have an edge 
connecting them to v. The set containing v and its neighbours is called closure and 
denoted by c/(v). Then, the local Markov property for undirected graphs can be 
written as 

Xv Xy\ cv) | Xne(v) We V. 
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Fig. 1 Example of undirected 
graph with five nodes. 


In the graph in Figure 1, for instance, according to the local Markov property 
X21L(X3,X4) | (X1,X5), as the neighbours of X) are X;,Xs5 and X3, X4 are the only 
nodes not in the closure of X2. 

Learning the conditional independence structure of an undirected graphical 
model consists of learning the p by p adjacency matrix «/, with entry 4; = 1 
when X; and X; are neighbours and «4; = 0 otherwise. A node-wise graph selection 
procedure consists of recovering the adjacency matrix line-by-line. Hence, learn- 
ing the structure of the graph involves p separate steps, each one concerning the 
conditional distribution of a single variable given the others. The idea behind the 
neighbourhood selection goes back to [1]. See also [14], among others. When one 
can link the parameters of each conditional distribution with some of those of the 
other conditional distributions a better approach is based on the pseudo-likelihood 
function. See, for example, [10]. 

Now, suppose that the collection of random variables Xy = {X,...,X p} follows 
an unknow joint distribution Fy (X; @), with finite first moment. Assume that de- 
pendence occurs through the conditional expectation, meaning that X, IL X; | X—v,-j 
iff 


F(X, | X) = F(X, | X_y-;j) = EX, | X] = EX, | X_y-jl, 
with X_,,_;=Xyv\{X,,X;}. Consequently 
X AX civ) | Xne(v) = E[X, | Xv\ci(v) Xne(v)] a EX, | Xne(v)]- 


For each node v € V, the dependence structure of X, on its neighbours is de- 
scribed by a Bayesian Additive Regression Trees (BART) [4], as follows 


m 
X = YF} X37), My) + &, (1) 
j=l 


where €, has a certain distribution, for instance N(0, 02). Here T; is the structure 
of a j!” tree. Moreover, M; is the subset of ©” containing the parameters for the 
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Fig. 2 Example of a regression tree with two variables and the corresponding partition. 


tree 7”, i.e. the mean vector over its terminal nodes, when regressing X,. In the 
following, the suffix v will be omitted when not necessary. 

Let yy = 2x DX... X Zi X Kopi X... X Ap be the product state space 
of the variables in X_, and # the space of all binary partitions R of Zy: A parti- 
tion is a binary partition if it can be obtained by sequentially dividing 2 into two 
parts by splitting only one component .27, j € V \v. Each structure 7; detects ex- 
actly one partition, say Rİ = {R},... ,Ra;} € 2. Then Mj has length dj. The tree 7; 
in (3) can be written as 


dj 


T(X_viT;,Mj) = Y Minl{xern}- (2) 


m= 


Figure 2 provides an example of tree and its corresponding partition. 

The Bayesian Backfitting Monte Carlo Markov Chain can be implemented to 
draw samples of 7; as suggested in [4]. This avoids the greedy search of the partition. 
Regularization prior distributions can be used for high dimensional settings. The 
importance of the variables can be computed via permutation inference as suggested 
by [2]. The inclusion of an additional linear component in (3) sets these models in 
the class of quasi-linear systems [16] 


m 
X,=BX_,+Y T} (Xv; T}, MY) +6). (3) 
j=l 


The Bayesian Backfitting Monte Carlo Markov Chain can be implemented also in 
this case. By the addition of a linear component, the depth of the trees can be re- 
duced, diminishing the risk of overfitting. 


Tree-based Non-linear Graphical Models 529 


3 Conclusions 


For many high dimensional problems, the assumption of joint gaussianity to study 
the association structure of the variables of interest may be inadequate. An inter- 
esting aspect of the approach I am proposing is that it can handle nonlinearities, 
interactions and also the case of mixed-type data. BART models are a tree-based 
method able to produce a proper inference and credible intervals. This aspect makes 
them more attractive than ordinary tree-based models and random forests. 
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Sentiment Analysis for micro-blogging 
using LSTM Recurrent Neural Networks 


Sentiment Analysis per il micro-blogging 
utilizzando LSTM Recurrent Neural Networks 


Sara Hbali, Youssef Hbali, Mohamed Sadgal and Abdelaziz El Fazziki 


Abstract In this paper we study a novel method that consider multiple in- 
formation. Not only textual properties, but also visual entries. By combining 
these information such proposed method offers more information to feed the 
model for sentiment analysis. Most of the papers used only textual proper- 
ties for sentimental analysis, whilst our contribution is to the add the visual 
property to achieve better movie review classification. 

Abstract In questo documento, studiamo una nuova metodologia che prende 
in considerazione diversi tipologie di informazioni. Non solamente proprietà 
testuali, ma anche input visivi. Combinando queste informazioni, tale metodo 
offre un maggior numero d’informazioni per alimentare il modello per l’analisi 
del sentimento. La maggior parte degli studi precedenti usano solo proprietà 
di tipo testuale per l’analisi sentimentale, mentre il nostro contributo con- 
siste nell’aggiungere proprietà visive in modo tale da ottenere una migliore 
classificazione di recensione del film. 


Key words: Sentimental analysis, features extraction, LSTM, CNN 


1 Introduction 


The principal of micro-blogging is based on users opinion, feedback and ex- 
periences about any chosen topic, the difference between regular blogs and 
micro is the content size, for instance twitter is one of the most popular social 
networking and communication platforms used to exchange and expressing 
ideas. However one tweet is only 140 characters with the possibility to add 
image or video, thus all previous works concentrate only on the textual con- 
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tent, which leaves the visual information inexploitable and which could offers 
more detailed information to use in sentiment analysis. 

The sentiment analysis, also known as opinion mining, is the process to 
detect whether the characteristics of tweets tends towards neutral, positive 
or negative. This research area exists before twitter and micro-blogging, since 
it can be applied to different domains, in Refs [9] that aims to retrieve little 
know communities but yet are relevant or [8] where they focus more on public 
opinion and news article to apply on the stock markets. In Refs [7] focus on 
tweets to evaluate the impact on votes for political candidates. 

In order to demonstrate the effectiveness of the method including and 
visual content, we will apply our approach to movies review classification. 


2 General approach 


The proposed model aims to process a large movie reviews for classification, 
therefor we consider that each movie review is a entry and one variable se- 
quence of words and the sentiment of each movie review must be classified. 


Visual content Textual content 


y Feature extraction Word enbedding layer 
y Movie recognition Classification of the sentiment 


Movie feedback 


Fig. 1 Complete model for movie classification. 


Movie review classification 


In these part of the paper we used model proposed in [5, 3, 4, 1, 2] to imple- 
ments standard LSTM model with variant modification The equations below 
describe how a layer of memory cells is updated at every timestep t in these 
equations 
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x is the input to memory cell layer at time t. 
Wi, Wp, We, Wo, Ui, Uf, Uc, Uo and Vo are weight matrices. 


bi, by and bo are bias vectors 


i, = o(Wyar + Uiht—1 + bi) (1) 
C; = tanh(W.x, + Uchi-1 + de) (2) 
fi = o(Wya, +Ufhi-1+b;) (3) 
ot = o(Woxy + Uchi-1 + bo) (4) 


Logistic regression 


a] 


Fig. 2 Illustration of the Mean pooling 

model used in this paper. h, LE i in TM 
It is composed of single LSTM >| LSTM eee si LSTM 
LSTM layer followed by T j Î 


mean pooling over time and 


a . % x * 
logistic regression. 


Equations (1), (2), (3) and (4) are performed in parallel to make the com- 
putation more efficient. This is possible because none of these equations rely 
on a result produced by the other ones. It is achieved by concatenating the 
four matrices W, into a single weight matrix W and performing the same 
concatenation on the weight matrices U, to produce the matrix U and the 
bias vectors b, to produce the vector b. Then, the pre-nonlinearity activations 
can be computed with : 


a= Wa, + Uhi-1 +b (5) 
The result is then sliced to obtain the pre-nonlinearity activations for i, f, 


C; , and o and the non-linearities are then applied independently for each. 


Image processing 


The entry step : analyzing attached image to identify the movie category In 
this step we extract features using CNN layer as explain in .. When we train a 
deep neural network in Caffe [6] to classify images, we specify a multilayered 
neural network with different types of layers like convolution, softmax loss 
and so on. 
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The last layer is the output layer that gives us the output tag with the cor- 
responding confidence value. In this model we used various layers for features 
vectors of the input image. 


Trainig Phase 


Labels Machine learning 


aia n ia nti 


Predective Phase 


as AI da 


Fig. 3 Illustration of the CNN model used in this paper. 


To match between our training data and the input image, we use dot 
product of two features vectors : 


dot(a, b)[i, j,k, m] = S (ali, j, :] * b[k, :, m]) (6) 


The result of the Dot product (6) of a and b is equivalent to matrix multipli- 
cation of the two vectors of training phase and predictive phase 


2.0.1 Experiment 

In this section we first introduced the dataset used in our experience and the 
implementation details and finally discuss results. 

2.0.2 Dataset 

Usually deep learning architectures need large training dataset. For this pur- 
pose, we used IMDB dataset for movie reviews and images collected from 
IMDB website, containing more than 25 000 polar movie reviews for training 
and 25 000 for testing and more than 400 000 images (celebrities profiles and 
and scene from movies..) used in [11, 10] for apparent age from single images. 


2.0.3 Implementation details 


In order to demonstrate the effectiveness of the method, we apply already 
pre-trained models on our example for both textual and visual content. 
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The entry data for our model is the image, we can predict multiple in- 
formation as explained in figure 4 which can give us hint about the textual 
content. such as the category of the movie or actors. 


Movie title 


Actor name 


Scene | 


Fig. 4 Extracted labels using visual content. 


Table 1 Achieved results of our the LSTM model 

Phase 1|Phase 2|Phase 3 
Loss 0.5570 [0.3530  |0.2559 
Accuracy 0.7149 |0.8577  |0.9019 
Training time|107s 107s 107s 


Training our model using IMDB dataset provides a very discriminative sen- 
timent analysis as shown in table 1. We challenged our model with complete 
additional information, the image recognition model achieves a detection rate 
of 89% using the features learned from a convolutional neural network model. 


2.0.4 Conclusion 


We propose optimization of sentiment analysis problem for IMDB movie re- 
views such that each review is labeled with either positive or negative senti- 
ment. then we combine review sentiment analysis with image analysis to give 
a complete overview of the requested content. 

One of the weaknesses of the proposed model that is it requires time and 
memory to have effective results. 

For future work, we can try various deep learning models for classification 
and explore other information such as location or video. 
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How to Exploit Big Data from Social 
Networks: a Subjective Well-being 
Indicator via Twitter 


Stefano Maria Iacus, Giuseppe Porro, Silvia Salini and Elena Siletti 


Abstract In our research we apply a new technique of opinion analysis over 
Twitter data to propose a new indicator of perceived and subjective well- 
being: The SWBI examines many dimension of individual and social life. In 
the purpose to investigate whether SWBI and its single components may ad- 
equately represent the reaction of a community to changes in everyday life 
conditions, we propose a comparative analysis, among the Italian provinces, 
of perceived well-being, measured with SWBI, with objective well-being, mea- 
sured with the Il Sole 24 Ore QoL Index. The idea is to create a composite 
well-being indicator which mixes stable official statistics and fluctuating so- 
cial media data. 

Abstract Nella nostra ricerca applichiamo una nuova tecnica di analisi dei 
dati provenienti da Twitter per proporre un nuovo indicatore di benessere 
percepito e soggettivo: L’SWBI considera molte dimensioni della vita individ- 
uale e sociale. Per indagare se l’SWBI e i suoi singoli componenti possano 
rappresentare in modo adeguato la reazione di una comunità ai cambiamenti 
delle condizioni di vita di tutti i giorni, proponiamo un’analisi comparativa, 
tra le province italiane, del benessere percepito, misurato con l’SWBI, e del 
benessere oggettivo, misurato con l’indice della qualità della vita de il Sole 
24 Ore. L’idea è di creare un indicatore composito di benessere che integri le 
statistiche ufficiali e i dati provenienti dai social media. 


Key words: well-being, social indicators, big data, social networks, senti- 
ment analysis 


Stefano Maria Iacus, Silvia Salini, Elena Siletti 

Department of Economics, Management and Quantitative Methods, University of 
Milan 

e-mail: stefano.iacus, silvia.salini, elena.siletti@unimi.it 


Giuseppe Porro 
Department of Law, Economics and Culture, University of Insubria 
e-mail: giuseppe .porro@uninsubria.it 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


538 Stefano Maria Iacus, Giuseppe Porro, Silvia Salini and Elena Siletti 


1 Introduction: Theoretical Frameworks 


In the last decades scholars have become increasingly interested in new mea- 
sures of quality of life. A milestone in 2009, when the so-called Stiglitz Com- 
mission proposed to build a system of objective and subjective indicators, 
with a strong influence in further studies: different indicators, with different 
structures, considering a great variety of dimensions and for many purposes 
are now considered. For subjective indicators, self-reports have been exten- 
sively used, forgetting that they are often misleading (9) and despite the 
efforts made it remains much uncertainty using them (6). The two main 
limitations: the influence that a single question can have, and the limited 
frequency of the surveys, that may fail in capturing the trend changes and in 
distinguishing between the short-run “emotional” and the structural compo- 
nent (“life evaluation” or “life satisfaction” ). 

Social networks offers a new rich source of information, which is avail- 
able without any survey, they simply allow to listen to. They host an open, 
enormous amount of data that allow to study social dynamics from an un- 
seen perspective. Analysing them allows to listen to what people say: with 
well-being this means to be able to measure feelings in real-time, mapping 
its fluctuation (5). In the last years researchers have used these data for a 
wide range of applications including monitoring influenza and other health 
outbreaks, predicting the stock market, and understanding sentiment about 
products or people. There exists a wide set of works aiming at tracking hap- 
piness through Twitter, for the Italian provinces, (5) propose the iHappy 
index, that is measured with an innovative statistical techniques on millions 
of tweets. 

Social media data enable to collect longitudinal data and to measure 
phenomena more frequently. Skeptics have questioned whether enthusiasts’ 
claims are overly optimistic (4), and whether any form of non-probability 
sampling as this new analysis is too risky (1). Others noted that media data 
may introduce new kind of bias (2), which raises the question of whether they 
are sufficiently reliable. We need to understand, to solve the new challenges: 
we can not ignore this new and rich source of information. While big data 
are unlikely to replace high quality surveys, they could be useful when there 
are not. The two methods can serve complementary functions. 

Sentiment analysis is the core aspect, despite many limitations (4), if cor- 
rectly performed, it seems to be a useful framework to exploit when the 
constraints of standard survey methodology may be too strong (8). On one 
hand there are no questions to pose, all that the analyst has to do is to listen 
to and classify the opinions expressed accordingly; on the other hand, the 
available information is updated in real time and hence the frequency can be 
as high as desired, allowing for separating the volatile/emotional component 
from the permanent /structural one. 

With the SWBI (Social Well-being index) we make a new proposal, relying 
on Twitter data and on one of the most recent techniques for sentiment 
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analysis. This approach disentangles the main methodological issues raised 
in the literature on well-being measurement, and produces a set of indicators 
that span the wide range of well-being perceptions. 


2 The SWBI 


The SWBI is a multidimensional indicator derived from a new human super- 
vised technique (iSA-Integrated Sentiment Analysis (3)) designed to capture 
several aspects. In iSA algorithm the human part is essential because infor- 
mation can be retrieved from texts without relying on dictionaries of special 
semantic rules. Human just read a text and associate a topic (D = “satisfied 
at work”) to it. Then, the computer learn the association between the whole 
set of words used in a text to express that opinion and extends the same rule. 

Formally, let us denote by D = {Do, Da, ..., Dm} the set of possible cat- 
egories (i.e. opinions). The target of interest is {P(D),D e D}, ie. the dis- 
tribution of opinions in a corpus of N texts. Do refers to Off-topic or not 
relevant texts (i.e. noises). Let S;, i = 1,..., K, be a vector of L possible 
stems which identifies one of the texts in a corpus. More than one text in the 
corpus can be represented by the same S; and is such that each element is 
equal to 1 if that stem is contained in a text, or 0 in absence. Formalized data 
set is {(s;,d;),j = 1,..., N} where sj € S (the space of possible vectors 9;) 
and d; can either be “NA” or one of the hand coded categories D € D. 

The “traditional” approach includes machine learning methods and sta- 
tistical models; predict the outcome of d; = D for the texts with S = sj 
belonging to the test set; when all data have been imputed, estimated cate- 
gories d; are aggregated to obtain an estimate of P (D). We can write 


P(D) = P(D|S)P(S) (1) 


Mx1 MxK Kx1 


where P(D|S) is a M x K matrix of conditional probabilities, and P(S) 
is a vector with the distribution of S; over the corpus. Once P(D|S) is es- 
timated from the training set with, say, P(D|S), then for each document 
in the test set with stem vector s;, the opinion d; is estimated with the 
simple Bayes estimator as the maximizer of the conditional probability, i.e. 
d; = argmaxpep P(D|S = s;). This approach does not work if P(Do) is 
very large compared to the rest of the D,’s. iSA follow the idea by (7) of 
changing the point of view but goes one step further in terms of computa- 
tional efficiency and variance reduction. Instead of (1), one can consider this 
new equation 

P(S) = P(S|D)P(D) (2) 

Kxl 


KxM Mxl 
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where now P(S|D) is a matrix whose elements P(S = S;,|D = D;) represent 
the frequency of a particular stem Sx given the set of texts which actually 
express the opinion D = D;. The solution of the problem is 


(inverse problem) P(D) = [P(S|D) P(S|D) ~* P(S|D)? P(S) (3) 
Mx1 MxM MxK Kx1 


Equation (3) is such that the direct estimation of the distribution of opin- 
ion P(D) is obtained but individual classification is no longer possible. In 
fact, this is not a limitation as the accuracy of (3) with respect to (1) is 
vastly better. Moreover, researchers are comprehensibly more interested in 
the aggregate distribution of opinions than in the estimation of individual 
opinion (3). 

To define SWBI, we inspired by NEF (New Economic Foundation) and 
their Happy Planet Index. It has eight dimensions concerning three dif- 
ferent well-being areas. Each component is defined through the hypotheti- 
cal question one might find: no questions, the sentiment is extracted from 
the text. Here the components: Personal well-being: emotional well- 
being-(emo), satisfying life-(sat), vitality(vit), resilience and self-esteem- 
(res), positive functioning-(fun);Social well-being: trust and belonging- 
tru), relationships-(re1);Well-being at work: quality of job-(wor). 

Each tweet has been classified according to the scale -1, 0, 1, where -1 is 
for negative, 0 is neutral and 1 is positive feeling. To enhance the action of 
human supervision, additional rules have been introduced: 


e Each tweet can be classified along one or more dimensions; 

e Only self-expressed or individual expression of well-being or own views of the 
tweeter are considered; 

e Re-tweet are considered, because the tweeters share the same view; 

e Off-Topic texts are marked appropriately; 

e If the encoders are not fully convinced about the semantic context they do not 
classify the text, just skip it and classify another one. 


Our data source are tweets written in Italian language from Italy, accessed 
through Twitter’s public API. Around 1 to 5% each day tweet contain geo- 
reference information which allows to build indicators at province level. From 
February 2012 we have stored and analysed more than 180 millions of tweets. 


3 The SWBI and the Il Sole 24 Ore QoL Index in the 
Italian Provinces 


Since 1990, the Italian business newspaper Il Sole 24 Ore publishes an in- 
dex of the quality of life (QoL) for all the Italian provinces. Since 2016, 
the composite indicator has six components based on a simple arithmetic 
mean of 42 normalized indicators. To analyse its components according to 
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the SWBI, we rescaled from 0 to 100. Here the components:I1-Income, Sav- 
ings, Consumption;12-Environment, Services, Welfare;13-Business, Work, In- 
novation; I4-Justice, Security, Crime; 15-Demographics, Family, Integration; 
I6-Culture, Leisure, Participation. As one can see, the Il Sole 24 Ore QoL 


Fig. 1 All the Figure refer to 2016, with red shades the original index, with blue 
shades the ranking of the Italian provinces 


index cover only material quality of life and, for this reason, has become a 
benchmark indicator for objective well-being. Despite efforts to improve the 
quality, the index, in addition to having a low frequency with only an annual 
data, often shows delayed information. This is a serious flaw when decision- 
makers want to base their choices on such information. As we noticed, SWBI 
has the twofold advantage to be a high frequency instrument, which can be 
updated almost in real time. On the other hand, SWBI is an index of sub- 
jective well-being, and the differences between the two dimensions (objective 
and subjective) clearly emerge from the comparison of the two indicator. 
As an example, we compare the SWBI component on well-being at work 
(wor) to the I3 (Business, Work and Innovation) component of Il Sole 24 
Ore QoL index, where the quality of work and labour market is evaluated by 
objective quantities (total employment rate, exports in % of GDP, number 
of innovative start-ups per 1000 enterprises, number of registered enterprises 
per 100 inhabitants, loans on deposits ratio, patent applications per 1000 
inhabitants, rate of youth unemployment 15-24 years). Clearly the informa- 


Fig. 2 SWBI and II Sole 24 Ore Index Components in Milan, in red lines respectively, 
the I3 and wor component 


tion conveyed by the two indicators is not the same. First of all (see Fig. 1, 
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left panels) shows a strong polarization: Northern and Central Italy have 13 
values significantly higher compared to the Southern provinces; (wor), on the 
other side, is more stable across provinces and does not show appreciable con- 
centration phenomena. The evidence is confirmed by the ranking of provinces 
according to (wor) and I3 values, respectively (see Fig. 1, right panels). 

Moreover, even if we polish out the volatility of (wor) due to its high 
frequency and compare the annual average values of (wor) and I3, different 
trends must be pointed out. Let us examine, for example, the indicators for 
the city of Milan since 2013 (see Fig.2): while I3 shows a slightly increasing 
trend, (wor) exhibits a remarkable increase starting from 2015, and the same 
behaviour is shown by almost all the SWBI components since 2016. Maybe 
that the feeling of a recovery of the economic conditions and an improved 
confidence in personal and collective future have an impact on perceived well- 
being even beyond the possibility to observe these improvements in current, 
traditional and objective economic indicators. 
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Network Analysis of Comorbidity Patterns in 
Heart Failure Patients using Administrative 
Data 


Analisi delle Reti di co-patologie in pazienti affetti da 
Scompenso Cardiaco 


Francesca Ieva 


Abstract In this work, we investigate the pattern of comorbidities in patients af- 
fected by Heart Failure (HF) through network analysis. Specifically, the pathologies 
are kept as the nodes of the network, while the links represent connections between 
two of them (a link is present if a patient is affected by both the pathologies in 
his/her last hospitalization). Thus, we study the comorbidity pattern of HF patients 
hospitalized in Lombardy between 2006 and 2012 using administrative data. We 
also applied techniques of community detection in order to detect groups of dis- 
eases which are more strongly connected. The application of network analysis to 
such data enables a new perspective in the study of heart failure disease. 

Abstract In questo lavoro, studiamo i pattern di comorbidità nei pazienti affetti da 
scompenso cardiaco attraverso un’analisi di rete (network analysis). Più precisa- 
mente, le patologie sono considerate come i nodi della rete, mentre i link rappre- 
sentano le connessioni tra due di loro (un link è presente se un paziente è affetto 
da entrambe le patologie durante la sua ultima ospedalizzazione). Studiamo quindi 
i pattern di comorbidità dei pazienti affetti da scompenso cardiaco che sono stati 
ricoverati in Lombardia tra il 2006 e il 2012 utilizzando dati amministrativi. Abbi- 
amo inoltre applicato le tecniche di community detection al fine di rilevare gruppi 
di malattie che sono più fortemente connesse. L’applicazione della network analy- 
sis su tali dati rappresenta un approccio innovativo per lo studio dello scompenso 
cardiaco. 


Key words: Heart Failure, Comorbidities, Administrative Data, Network Analysis, 
Community Detection 
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Introduction 


Congestive heart failure (CHF) is a disease that occurs when the heart muscle can- 
not pump properly the blood into the vessels. The term congestive is used since 
a common symptom of HF is congestion, i.e., too much fluid in tissues and veins 
is retained. The most common symptoms of HF mainly affect lungs and respira- 
tory system in general as well as circulation. On the other hand, the HF can also 
cause liver enlargement and coagulopathy. From a clinical and social perspective, 
HF is one of the main public health issues and it still carries substantial morbid- 
ity and mortality, with 5-year mortality that rival those of many cancers. Moreover, 
the prevalence of HF in the world seems to show an increasing trend due mainly 
to the enlargement of life expectancy. In addition to this scenario, HF is often part 
of a comorbidity and this contributes to the worsening of the quality of life of HF 
patients. 

On the other hand, the importance of complex networks in mathematical model- 
ing is growing very fast in the recent years [1]. This is due to their flexibility and to 
the more and more consistent presence of relational data collected nowadays in con- 
text like social networks, interacting dynamical systems, social media, web pages, 
spreading processes, transportation systems, biological interactions and many oth- 
ers. 

The aim of this work is to adopt a network approach in the study of the comor- 
bidities recorded in HF patients. Specifically, we wish to investigate relationships 
among morbidities accompanying HF. Moreover, we target this goal for the first 
time in literature using administrative data in Italy. According to this approach, the 
perspective is shifted from the patients to their diseases. This allows us to study the 
pathologies taking advantage of techniques of network analysis. The administrative 
data supporting the study regard HF hospitalizations occurred in Lombardy (the 
most populated Italian region) between 2006 and 2012. The advantages of using 
administrative data are various: indeed, they are population based and their updates 
continues over time. There are also some drawbacks linked to the use of adminis- 
trative data: in particular, they are not collected for epidemiological/statistical pur- 
poses. Moreover, the criteria of codifications for pathologies may vary over time 
and this complicates the merging of different administrative databases. 


1 Data 


1.1 Heart failure (HF) data 


Originally, we deal with a big dataset made up of 503,247 hospitalizations regarding 
142,587 different patients hospitalized in Lombardy between 2006 and 2012. In 
particular, all the hospitalizations concerned with congestive heart failure (that we 
will call from now on simply HF) are considered. 
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For each record (i.e., hospitalization) 76 variables are available. They can be 
divided into four categories or blocks: 


1. patient personal information (e.g., his/her regional code, their totalnumber of 
hospitalizations, sex and age) 

2. details of the hospitalizations (e.g., all the surgical procedures and the comor- 
bidities observed for that patient) 

3. pharmacological treatments (e.g., the drugs taken by the patient after his/her stay 
in hospital) 

4. ambulatory services the patients made use of before their stay in hospital. 


Since we want to investigate the comorbidities patterns of patients, the most impor- 
tant block for our purposes is the second, composed by twenty dummy variables 
for each record. Each dummy indicates whether or not a disease affects the patient. 
Figure | shows the list of diseases we are going to consider (on the left the keyword 
is presented, on the right the corresponding disease or disorder). 
Apart from the information about the comorbidities summarized by the dummy vari- 
ables, we kept only a few other variables such as the identification number of each 
patient (ID), the number of current hospitalization per patient (adm number), the sex 
of the patient, their age, the date in which the patient leaves the study (dateOUT), 
the date of discharge (dateDISCHARGE), the number of days between the admis- 
sion to the hospital and the leaving of the study (timeADMtoOUT) and the dummy 
variable which indicates whether or no a patient died (DEATH) without distinguish- 
ing if the death occurred in hospital or not. We decided to omit all the others in this 
work since they were not useful for our purpose that is investigating if, and also 
how, the patterns of comorbidities evolve during the years of the study through the 
usage of proper networks for representing the data. 

Beyond the choices about the variables that can be useful for us, we also need a 
proper pre-processing and reshaping of our data in order to apply network analysis 
to them. 


i 
| ri 

|| pulmonary circulation disorders 
ia 


HIV infection/ AIDS | 


Fig. 1 List of diseases with 
corresponding acronyms 


hypertension hypertension | 
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1.2 Data pre-processing 


In order to properly represent the data within a network, we extracted only the last 
hospitalization of each patient. This choice is due to the way the morbidity load is 
computed in the dataset: once a morbidity appears, it remains “active” also in the 
following hospitalizations of the relative patient. Moreover, since the last hospital- 
ization will present the worst case for each of the patients, this is a good indicator for 
comorbidities diffusion. We also extracted age and sex as covariates for each patient 
and then we computed the average of the age and the percentage of men affected 
by each morbidity. These features will be considered as nodes attributes in our net- 
works. In addition, we chose to consider the death as a morbidity, not distinguishing 
if it took place in hospital or not. This choice is aimed at discovering which mor- 
bidities are directly connected to the death, since they will be more perilous than the 
others. 

Given the network, our aim is to investigate whether, and potentially how, the 
patterns of comorbidities evolve in time. We then built one network per year. We 
decided to make each patient contribute only to one network, that is the one related 
to the year of the patient’s last hospitalization. This approach guarantees the fact 
that if a patient is recorded in the network regarding one year, that patient cannot be 
found also in another network representing a different year. 


2 Building the Network 


Separating the years of hospitalizations and making patients contribute only to the 
network containing their last hospitalization, we get seven networks, one for each 
year between 2006 and 2012. In other words, we made a photography of the comor- 
bidity patterns of the patients year by year, not inducing dependencies between the 
different years. We also considered a patient as contributing to the death edge only 
if his/her death happens during the year his/her last hospitalization happens. 

Building the networks with this approach, we still obtained very dense graphs, 
so in order to reduce the amount of connections considering only the edges whose 
weights are significant, we decided to consider only the positive correlations and 
also to impose a threshold on them empirically. 

In Figure 2 it is possible to see an example of the network for year 2007. This 
network consists on twenty nodes that are nineteen different morbidities (HF is ex- 
cluded) plus the death, while the links are weighted by @-correlations (see [4] for 
details). In particular, we considered only links that had a @-correlation [5] greater 
than 0.02. 

Notice that we chose a small threshold in order to keep as many as possible con- 
nections among the ones with a positive associated weight. Nodes have different 
shapes: they are circular shaped if the disease is mainly observed in women rather 
than men, otherwise they are square shaped. On the other hand, the edges widths are 
proportional to the corresponding weights. Finally, each vertex has a color represent- 
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ing the prevalence of the disease, that is the proportion of the population presenting 
that morbidity. 


3 Methods and Results 


A preliminary descriptive analysis of the networks descrbed in Section 2 was carried 
out, according to the literature on network analysis (see, among others, [2], [3] and 
[4]). We observed that the most of the diseases have a small prevalence apart from 
arrhythmia, hypertension, renal and pulmonarydz. We have also noticed that the 
death vertex is a very connected and important node. We remind that the prevalences 
of the death coincide with the percentages of patients whose last hospitalization 
occurred in that year and that also died in the same year. 

The observed high values of the global transitivity indicators highlight that the 
pathologies tend to form complex agglomerations of morbidities, i.e., rarely HF 
appears alone. Indeed, HF is often present with a comorbidity load composed by 
three other diseases. Moreover, in our networks, we noticed that the least connected 
pathologies are also the ones that tend to take part in more complex clinical condi- 
tion. 

Locally, the pathologies can be easily ranked thanks to centrality measures: in 
particular, we ranked the diseases accordingly to their spread in HF patients (preva- 
lence), their propensity to form connections and/or strong connections (degree and 
strength) and their closeness to all the others. In practice, as an innovative result 
we proposed a mixed index for the ranking of the pathologies. This approach led 


2007 with threshold = 0.02 


Shapes 
O men 
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Fig. 2 Network of comor- Pravalences 
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Œ (0.20.3) 
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about the prevalence of the =m (04,05) 
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us to focus on the important diseases in all the years of the study. Thanks to this, 
we discovered peculiar diseases of HF patients such as pulmonary disease, hyper- 
tension and renal insufficiency. Similarly, the @-correlations can be instead used in 
order to rank the pairwise comorbidities, which were represented by the links in the 
networks. We observed that among the heaviest edges we could find many edges 
connected to the node DEATH, but other pathologies were involved as well. We 
showed the trends of the most important pairwise comorbidities of HF patients. 

Finally, community detection algorithms [6] allowed the agglomeration of mor- 
bidities to be pointed out, indeed we identified some groups of diseases that are 
more connected to each other with respect to the diseases of other groups, and suit- 
able interpretations can be made. 


4 Conclusions 


In this work we showed a novel approach to the analysis of comorbidity patterns 
in patients affected by HF using networks. It represents a new method that can be 
adopted for different kind of analyses and pathologies. 

The conclusions reached in this work are only a starting point for deeper anal- 
yses that can be done in the future. However, this work showed that networks are 
powerful and promising tools for investigating such kind of problems. 
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Automatic variable and components weighting 
systems for Fuzzy cmeans of distributional data 


Sistemi automatici di pesi di variabili e componenti per il 
Fuzzy cmeans di dati distribuzionali 


Antonio Irpino, Francisco de A.T. De Carvalho, Rosanna Verde 


Abstract A distributional variable describes an object by a 1-D probability or fre- 
quency density function. While in standard clustering algorithms all the variables 
contribute to the clusters definition with the same importance, subspace clustering 
aims at finding a subspace, as a linear combination of the original variables, where 
clusters are well represented. This is done by weighting variables automatically and 
accordingly to their capacity of being discriminant for the clusters. Considering a 
decomposition of the squared Lz Wasserstein distance for distributional data, and 
using the notion of adaptive distance, we extend a fuzzy subspace clustering for au- 
tomatically computing relevance weights associated with variables as well as with 
their components. This is done for the whole dataset or cluster-wisely. An applica- 
tion shows the advantages of using such algorithms. 

Abstract Una variabile distribuzionale permette di descrivere un oggetto atteverso 
una funzione di densita di probabilita o di frequenza. Mentre negli algoritmi stan- 
dard di clustering tutte le variabili contribuiscono allo stesso modo alla definizione 
dei gruppi, le tecniche di subspace clustering cercano di individuare un sottospazio, 
come combinazione lineare delle variabili originarie, dove i gruppi siano ben rapp- 
resentati. Ciò è ottenuto attraverso l’individuazione automatica di un sistema di pesi 
per le variabili derivante dalla loro capacità discriminatoria. Utilizzando unaparti- 
colare decomposizione della distanza L2 di Wasserstein per dati dati distribuzionali, 
e utilizzando la nozione di di distanza adattativa, proponiamo delle estensioni di un 
algoritmo fuzzy di subspace clustering che permetta di calcolare automaticamente 


Antonio Irpino 
Dip. di Matematica e Fisica, Università degli Studi della Campania “L. Vanvitelli”, Viale Lincoln 
5, 81100 Caserta e-mail: antonio.irpino@unicampania.it 


Francisco de A.T. De Carvalho 
Centro de Informatica, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes s/n 
- Cidade Universitaria, CEP 50740-560, Recife-PE, Brazil e-mail: fatc@cin.ufpe.br 


Rosanna Verde 
Dip. di Matematica e Fisica, Università degli Studi della Campania “L. Vanvitelli”, Viale Lincoln 
5, 81100 Caserta e-mail: rosanna.verde @unicampania.it 


549 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


550 Irpino, De Carvalho, Verde 


dei pesi associati alle variabili o alle loro componenti. I pesi possono riferirsi 0 
all’intero insieme di dati o al singolo gruppo. Un’applicazione mostra i vantaggi 
degli algoritmi proposti. 


Key words: Distributional data, Fuzzy cmeans, subspace clustering, automatic 
weights, Wasserstein distance. 


1 Introduction 


A distributional (or distribution-valued) data is observed when an object is described 
by a distributional variable, since its realizations are 1-D frequency (or probability) 
density functions. Such kinds of data can be observed in many practical situation. 
For example, official statistics institute, in order to preserve the privacy of respon- 
dents, cannot diffuse microdata collected from a territorial unit, but only a summary 
of such data. A similar case occurs with repeated data observed on individuals col- 
lected, for example, from a bank or an hospital. Also in this case, only a summarized 
version of such data can be available. Empirical parametric or non-parametric den- 
sity estimates are useful tools for this aims. Clustering aims to organize a set of 
objects into groups such that those within a given cluster are more similar with re- 
spect the ones of a different clusters. Partitioning and hierarchical clustering are two 
possible approaches. According to how much an object belongs to a cluster, in hard 
clusterings an object is assigned to a cluster, while in fuzzy[?] ones an object may 
belong, according to a membership degree, to more than one cluster at the same 
time. 

The most clustering algorithms proposed for histogram data are partitioning hard 
clustering methods[14, 18, 11, 15, 16, 17]. 

However, particular structure of the observed distributional data could give clus- 
ters not well separated and with a high internal variability due to the presence of 
some data that are forced to belong to only one cluster. In presence of this kind 
of problem, the fuzzy clustering algorithm is a suitable choice. This paper extends 
Refs. [11, 15, 16, 17] by proposing a fuzzy c-means clustering algorithm. 

Another main issue in clustering analysis is to consider the different contribution 
of the several variables in the clustering process. Generally, clustering methods do 
not take into account the different relevance of the variables in the analysis. How- 
ever, in most applications, some variables may be more discriminant of the clusters 
than others; in some other applications each cluster may have a different set of more 
relevant variables to group together the data. The approach of subspace clustering 
[1] aims at finding a subspace of the original descriptor space using a linear com- 
bination of the original variables. Generally, when data are described by a large 
number of variables, subspace clustering act as feature selection method, too. The 
subspace (if all the data are considered) or the subspaces (if a subspace is generated 
for each cluster) produced by such algorithms are optimal with respect a criterion 
that maximizes the homogeneity of clusters and/or maximize the separation among 
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them. A similar result was already reached in [6], where the use of adaptive dis- 
tances was proposed for clustering standard data. 

In the framework of Symbolic Data Analysis, [4, 5, 3] proposed several adap- 
tive distances, based on Hausdorff, City-Block and Euclidean distances in dynamic 
clustering algorithm of set-valued data. Recently [13] a partitioning hard cluster- 
ing algorithm using an adaptive distance based on the L? Wasserstein metric has 
been proposed. The authors propose two novel adaptive distances based on cluster- 
ing schemes able to compute automatically the relevance of each histogram vari- 
able during the partitioning of the data set. Starting from a decomposition of the 
Lo Wasserstein distance [12] and considering the variability measure introduced in 
[17], the distance between two distributional data can be shared in two components: 
one related to the variability of averages of the distributions and another related to 
the different variability of the compared distributions. In all the algorithms based 
on the approach of adaptive distances of [6], a k-means-like algorithm is proposed, 
where the minimization of an homogeneity criterion is subject to a constraint on 
the product of relevance weights. On the other hand, considering the subspace clus- 
tering approaches reviewed in [1], a constraint on the sum of relevance weights is 
considered. 

In this paper, we consider to extend a subspace fuzzy c-means algorithm to distri- 
butional data using adaptive L2 Wasserstein distance. Taking into consideration the 
L Wasserstein distance decomposition in two additive components [17] we propose 
adaptive distances that take into account the two components of the variability of a 
set of distributions. We propose to associate two sets of weights with each variable 
and with each component, such that the sum of weights for the whole dataset or 
for each cluster is equal to one. The proposed fuzzy clustering algorithm, based on 
adaptive distances, alternates three steps that estimates the membership of the ob- 
jects to the clusters, the weights for each variable and/or each component, and the 
cluster prototypes. 
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A Bayesian oblique factor model with extension 
to tensor data 


Un modello fattoriale obliquo bayesiano con estensione a 
dati tensoriali 


Michael Jauch, Paolo Giordani, and David Dunson 


Abstract In this short paper, we discuss a novel way of constructing prior distri- 
butions for correlation matrices and an associated approach to inference. We con- 
struct a prior penalizing large correlations, which we incorporate into an oblique 
factor model and a Candecomp/Parafac model for three-way data. We argue that 
this choice of prior for the factor correlation matrix, combined with a shrinkage 
prior for elements of the factor loadings matrix, leads to interpretable solutions. At 
the meeting we will demonstrate this through applications to real data. 

Abstract /n questo short paper discutiamo un nuovo modo di costruire distribuzioni 
a priori per matrici di correlazione ed i relativi aspetti inferenziali. La distribuzione 
a priori, costruita in maniera tale da penalizzare correlazioni elevate, viene in- 
serita all’interno di un modello di analisi fattoriale obliqua e del modello Cande- 
comp/Parafac per dati a tre vie. Riteniamo che questa scelta della a priori per la 
matrice di correlazione fattoriale, combinata con una a priori shrinkage per gli 
elementi della matrice dei loading fattoriali permette di ottenere soluzioni inter- 
pretabili. Al convegno dimostreremo il nostro assunto mediante applicazioni a dati 
reali 


Key words: oblique factor model, prior for correlation matrices, tensor decompo- 
sition, three-mode factor analysis 
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1 Introduction 


Factor analysis aims to explain the covariance structure between observed variables 
as arising from a smaller number of unobserved latent factors. A Gaussian factor 
model with factor dimension S has the form 


y |B. fi, Z ~N(BF;,2) , fi, ~N(0,2) (1) 


where y; is the centered vector of observed variables corresponding to the ith ob- 
servation, B is the factor loadings matrix, f; is the S-dimensional vector of latent 
factors for observation i, Q is the covariance matrix of the latent factors, and Æ is a 
diagonal positive definite matrix. Marginalizing out the latent factors yields 


y;|B,= ~ M(0,BQB + £). (2) 


As is well-known, the factor model is not identifiable without further restrictions 
on B,Q, 2. Identifiability assumptions are important in Bayesian computation as a 
means to ensure that estimation based on posterior samples is meaningful. See [14] 
for a discussion of identifiability of the oblique factor model. We impose the usual 
restriction that Q be a correlation matrix. 

If Q is diagonal then the latent factors are uncorrelated (and thus independent) 
and we obtain the conventional orthogonal factor analysis model. If Q is not di- 
agonal, the latent factors are correlated and we obtain the so-called oblique factor 
model. Many authors have argued that the restriction to uncorrelated factors is too 
strict. For example, discussing application of factor anaylsis in psychology, Thur- 
stone [19] remarks 


It seems just as unnecessary to require that mental traits shall be uncorrelated in the general 
population as to require that height and weight be uncorrelated in the general population. 


A large body of methodology has been developed for the oblique factor model. For 
a detailed discussion, see Chapter 12 of Harman [11]. 

In traditional applications of factor analysis, interest lies in interpreting the latent 
factors as distinct and scientifically meaningful quantities. Interpretation proceeds 
by examination of the factor loadings matrix, which relates the latent factors to 
the observed variables. Interpretation is made easier if the factor loadings matrix 
possesses a simple structure. An example of a simple structure is near sparsity, in 
which the factor loadings matrix has a small number of large entries and a large 
number of small entries. Typically, allowing correlated factors allows for a simpler 
structure in the factor loadings matrix. However, correlated factors present their own 
difficulties to interpretation. If two factors are highly correlated, then it becomes 
impossible to interpret them as distinct quantities. When allowing for correlated 
factors, we should recognize the tradeoff between factor correlation and complexity 
of the loadings matrix. We tolerate some of the former if it buys us less of the latter, 
and vice versa. This idea is captured in a quote from [11]: 


It is clear that a certain simplicity of interpretation is sacrificed upon relinquishing the stan- 
dard of orthogonality. This disadvantage may be offset, however, if the linear descriptions 
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of the variables in terms of correlated factors can be made simpler than in the case of un- 
correlated ones. Generally this is possible. 


We can address this tradeoff in a Bayesian oblique factor analysis setting through 
the choice of prior distributions for B and Q. For those elements of the factor load- 
ings matrix B which are not restricted to be zero for the sake of identifiability, we can 
take advantage of local-global shrinkage priors which will result in a nearly sparse 
estimate for B [15]. For the factor correlation matrix Q, we need a prior on the set 
of correlation matrices which penalizes factor correlations. Defining such a prior 
distribution that lends itself to relatively simple and scalable inference is challeng- 
ing. For a recent approach in the context of Bayesian factor analysis with correlated 
factors, see [9] which provides extensive references to earlier relevant works. 

A main contribution of this short paper is what we believe to be a novel approach 
to constructing priors for correlation matrices. The construction is based on the ob- 
servation that, for any N x P matrix X with unit norm columns, the product XTX 
is a correlation matrix. Assigning each column of X a probability distribution hav- 
ing support on the unit sphere then induces a probability distribution on correlation 
matrices. In the special case that each column of X is independent and uniformly 
distributed on the unit sphere, we obtain closed-form densities for the correlations 
which match priors discussed previously by [1, 9], and others. The proposed prior 
does indeed penalize correlation, and the penalty increases as N increases. In this 
short paper, we only discuss the special case of independent columns uniformly dis- 
tributed on the unit sphere, but future work may consider other choices, leading to 
more flexible distributions for correlation matrices. Inference for parameters lying 
on the unit sphere can be performed using geodesic Monte Carlo [6], a scalable 
Markov chain Monte Carlo (MCMC) method which can accomodate parameters 
lying on manifolds embedded in Euclidean space. 

We define a Bayesian oblique factor model using the aforementioned prior for 
the factor correlation matrix and a global-local shrinkage prior for elements of the 
factor loadings matrix. We discuss extension of the factor model to tensor valued 
data with an emphasis on the three-way case. For the conference presentation, we 
will show applications to real data. 


2 A Bayesian model for oblique factor analysis 


Suppose we have / observations of J variables. We let y; be the vector of J centered 
variables corresponding to observation i. As before, we suppose that 


y,=Bf,+e, f,;~N(0,Q) i=1,...,1 (3) 


where B is the J x S factor loadings matrix, f; is the S x 1 vector of latent fac- 
tors for observation i, Q is a S x S correlation matrix, and e; is a J x 1 vector 
of errors. The errors are independent and identically-distributed N(0, £) where 
z= diag(o?, sees 07) is a diagonal positive definite matrix. In matrix form, we have 
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that 
Y=FB'+E (4) 


where Y = (yj,...,7)" is the J x J data matrix, F = (f,...,f;)" is the Z x S matrix 
of latent factors, and E = (e1,...,e;)T is the J x J matrix of errors. To complete the 
Bayesian model specification, we need to define priors for our parameters. 


2.1 The prior for Q 


As described in the introduction, let the matrix X have columns x,...,xp € 7-1, 
the N — 1-dimensional unit sphere. Suppose that the columns of X are independent 
and uniformly-distributed on ./y-1. We then set Q =X Tx. 

Due to the simple construction for Q, it is possible to derive closed-form expres- 
sions describing its distribution [8]. For instance, let @ be an arbitrary off-diagonal 
element of Q. Then the density of @ is 


1 T (3) 2) 2 
@) = —=— = (1-@*) 72 , @€|-1,1|, 5 
po) = erento), cel] 6 
the even-order moments are 
m 2j-1 
2m 
SH 
E(@°)=II xa; og: 13123 (6) 


j=l 
and the odd-order moments are zero. As Fig. 1 makes evident, 


PEL Beta((N- 1)/2,(N~1)/2) a 


and the prior places a penalty on correlations which increases with N. 

The above properties make it clear that we have presented an alternate way of 
constructing a prior distribution for correlation matrices having the same marginal 
distributions for the correlations as the prior for correlation matrices discussed in 
[9] and the relevant references given there. However, our prior construction naturally 
leads to a wide variety of flexible generalizations (by choosing different distributions 
on the unit sphere) and allows for a different MCMC approach to inference based 
on Geodesic Monte Carlo [6]. 


2.2 Completing the prior specification 


We would like to choose a prior for B favoring a simple, nearly sparse structure. A 
variety of global-local shrinkage priors [15, 4] have been proposed which satisfy this 
requirement and have desirable posterior concentration properties. These glocal- 
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local shrinkage priors can typically be represented as scale mixtures of Gaussians, 
simplifying computation. 

As mentioned in the introduction, identifiability assumptions are important in 
Bayesian computation because they ensure that estimation based on posterior sam- 
ples is meaningful. We have already constrained £ to be a correlation matrix. The 
article by Peeters [14] gives three additional conditions on B which, under the usual 
regularity assumptions, guarantee identifiability of the oblique factor model. Deci- 
sions about how to satisfy those conditions should be made on a case by case basis. 

We can assign conventional priors for variances to the diagonal elements of £, 
e.g. inverse gamma or reference priors. 


3 Extension to tensor data 


When / observations of J variables are collected at K occasions we have a three- 
way array or tensor denoted by Y of order / x J x K. Occasions may refer to time or 
in general to different conditions. Three-way tensor data are characterized by three 


Densities of w for various N 
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Fig. 1 Density of @ for various values of N. The densities are shifted and scaled 
Beta ((N — 1)/2, (N — 1)/2) densities. 
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modes, namely observation, variable and occasion modes. We let y; be the vector 
corresponding to observation i. In contrast to the standard two-way case, y; contains 
the scores of J centered variables at K occasions and thus has length JK. 

In principle, the two-way factor model in (4) might still be applied for tensor data. 
In fact, it would be sufficient to juxtapose next to each other the observation-by- 
variable matrices collected at every occasion obtaining a matrix with rows given by 
y,,---,y,- Such a matrix, usually denoted by Y4, is the so-called observation mode 
matricization (or unfolding) of the tensor Y. However, in practice, the decomposi- 
tion of YA through the factor model in (4) is inappropriate because the interactions 
among the modes cannot be modelled. 

A more sensible strategy is the three-mode factor analysis model known as Can- 
decomp [7] or Parafac [12] which we will refer to as Candecomp/Parafac or, more 
briefly, CP. The CP model can be expressed as 


Y\=F(COB) +E, (8) 


where B and C are the factor loadings matrices for the variables (of order J x $) 
and for the occasions (of order K x S$), respectively. They capture the influences 
of the variables and occasions on the S latent factors. The symbol © denotes the 
Khatri-Rao product, the Kronecker product between pairs of columns (C © B = 
[ci bij |cs @ bs], with B = (b1,...,bs) and C = (c1,...,65)). E= (e1,...,e))T 
is the J x JK matricization of the tensor of errors E. In contrast with the two-way 
case, under mild conditions (see, e.g., [13]) the solution of the CP model is identified 
up to trivial scaling and simultaneous permutation of the columns of F, B and C. 

The CP model was originally proposed as an exploratory tool without probabilis- 
tic assumptions. The probabilistic version was developed in [5] and [2, 3]. Actually, 
such probabilistic counterparts were proposed for the so-called Tucker3 model [20], 
which we will refer to as the T3 model. The T3 model represents an alternative 
three-mode generalization of the two-way factor analysis model. It can be formu- 
lated as 


Y,=FG,(C@B)' + Eu. (9) 


In the T3 model, each mode has its own factors and different numbers of latent 
factors for each mode can be assumed. Hence, F, B and C are matrices of order 
Ix P,JXx Q, and K xR, respectively, with P, Q and R denoting the numbers of 
factors for each mode. The triple interactions among the factors of the three modes 
are captured by the P x Q x R core tensor G, the generic element of which, gpgr, 
expresses the strength of the interaction among factor p for the observation mode, 
factor q for the variable mode, and factor r for the occasion mode. Note that in (9) 
G, denotes the observation mode matricization of G. 

The CP and T3 models are closely related. If P = Q = R = S and G has a su- 
peridentity structure (gpg, = 1 when p = q = r and 0 otherwise), then it is easy to 
see that formulas (8) and (9) coincide. Therefore, the CP model can be seen as a 
constrained version of the T3 model where the same number of latent factors is as- 
sumed for all the modes and each factor of a certain mode interacts with exactly one 
factor of the other modes. This produces some relevant distinctions between the two 
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models. The T3 model is more general than the CP model, but the solution is not 
identified. Equally well-fitting solutions can be found by rotating the factor matrices 
and compensating for such rotations in the core. On the other hand, the CP model 
has a more parsimonious structure and the solution, as mentioned, is identified. For 
this reason, we focus our attention on the CP model. 

If the latent factors of F are correlated, the covariance structure of the data in- 
duced by the CP model takes the form (see also [17]) 


(COB)Q(COB) + E. (10) 


Hence, under the same assumptions adopted in the two-way case, the model for the 
ith observation in the tensor case is 


y;|B,C,2,£ ~ N(0,(COB)Q(COB)" + E). (11) 


A Bayesian CP model can be developed as a natural generalization of the two- 
way model presented previously. The prior of Section 2.1 can again be used for 
the factor correlation matrix Q. For the elements of B and C, we can again use 
global-local shrinkage priors which favor a nearly sparse structure. The uniqueness 
of the CP solution up to scaling and simultaneous permutation of the columns of B 
and C implies that we only need to worry about column switching in the posterior 
samples, since the prior distributions for B and C fix their respective scales. Two 
solutions are a relabeling scheme in the style of [18] or simply fixing particular 
elements of B or C to be zero so that the columns can no longer be permuted. 
The Bayesian CP model enjoys the same advantages as the two-way oblique factor 
model. In particular, we hope to demonstrate that allowing (but penalizing) latent 
factor correlation and applying a shrinkage prior on the factor loadings leads to a 
general yet interpretable CP model. 


4 Inference 


For inference, we use geodesic Monte Carlo [6]. Geodesic Monte Carlo extends 
Hamiltonian Monte Carlo [16] to certain special manifolds which can be embed- 
ded in Euclidean space, such as the simplex, the unit sphere, or the Stiefel man- 
ifold. Like Hamiltonian Monte Carlo, geodesic Monte Carlo can generate distant 
Metropolis-Hastings proposals with a high probability of acceptance, ideally lead- 
ing to a rapidly-mixing Markov chain with low autocorrelation. As described in [6], 
sometimes parallel tempering [10] is required to move between isolated modes of 
the posterior distribution. 
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Statistical analysis for partially observed 
multilayered networks 


Analisi statistica di reti multi-strato parzialmente 
osservate 


Johan Koskinen, Chiara Broccatelli, Peng Wang, and Garry Robins 


Abstract Multilayered networks have been proposed as a joint representation of 
associations between multiple types of entities or nodes, such as people and organi- 
zation, where two types of nodes gives rise to three distinct types of ties. The typical 
roster data collection method may be impractical or infeasible when the node sets 
are hard to detect or define or because of the cognitive demands on respondents. 
Multilayered networks allow us to consider a multitude of different sources of data 
and to sample on different types of nodes and relations. We consider modelling mul- 
tilayered networks using exponential random graph models and extend a recently 
developed Bayesian data-augmentation scheme to allow partially missing data. 


Abstract Le reti multi-strato sono state proposte come una rappresentazione con- 
giunta di associazioni tra diversi tipi di entita o nodi, quando due tipologie diverse 
di nodi generano tre diverse combinazioni di legame. I metodi tradizionali di rac- 
colta dati non sempre sono utilizzabili, sia perché l’insieme di nodi potrebbe essere 
difficile da definire, sia in seguito ad eventuali esigenze cognitive degli intervistati. 
Le reti multi-strato offrono un vantaggio in queste situazioni poiché permettono 
di utilizzare più fonti d’informazione congiuntamente e semplificano il campiona- 
mento di diversi nodi e relazioni. Questo lavoro intende modellare le reti multi- 
strato tramite i modelli esponenziali per reti casuali (ERGM) ed estende una re- 


Johan Koskinen 
Social Statistics Discipline Area, University of Manchester, Manchester M13 9PL e-mail: 
johan.koskinen @ manchester.ac.uk 


Chiara Broccatelli 
Sociology, University of Manchester, Manchester M13 9PLe-mail: 
chiara. broccatelli @ postgrad.manchester.ac.uk 


Peng Wang, Centre for Transformative Innovation, Faculty of Business and Law, Swinburne 
University of Technology, Australia e-mail: pengwang @swin.edu.au - Garry Robins,Melbourne 
School of Psychological Sciences, The University of Melbourne, Australia e-mail: gar- 
rylr@unimelb.edu.au 


561 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


562 Johan Koskinen, Chiara Broccatelli, Peng Wang, and Garry Robins 


cente tecnica di data-augmentation fondata sul paradigma Bayesiano che consente 
di gestire dati parzialmente mancanti. 


Key words: ERGM, Exchange algorithm, Missing data, Multilevel networks, So- 
cial Networks 


1 Introduction 


Kivela et al (2014) coined the term ‘multilayered networks’ as a general frame- 
work for jointly designating multiple types of network data, such as one-mode, 
two-mode, and multiplex networks, where researchers had typically dealt with each 
instance separately. Here we are primarily considering the extension of the exponen- 
tial random graph (ERGM) family of distributions proposed by Wang et al. (2013) to 
the subclass of multilayered networks typically referred to as ‘multilevel networks’ 
(Lazega et al., 2008) even though the key ideas in dealing with partially observed 
data generalises to other extensions of ERGM, such as multiplex networks (Pattison 
and Wasserman, 1999). In situations where you are likely to have imperfect infor- 
mation on network ties, availing yourself of the full set of tools that may be derived 
from a wider framework for networks may prove beneficial. 


2 Data Structure 


We assume two distinct set of nodes: A = {1,...,n} and B = {1,...,m} where we 
might observe ties among all combinations of nodes type. A tie thus belong to either 
of the sets (3) ,A x B, or (3) . In the sequel we will use AA, AB, and BB as a notational 
shorthand for these edge-sets, with the corresponding incidence matrices X44, X,4z, 
and Xgp, respectively. The element Xz, of matrix Xz is equal to 1 if the edge v € E 
belongs to the graph and 0 otherwise. The multilevel network may be represented 


as a one-mode network with a blocked, symmetric adjacency matrix 


X= Xaa Xap 
Xza XBB 


When extending binary one-mode networks to multiple relations (say ‘friendship’ 
and ‘advice’) it is convention to represent this as a collection of graphs or adjacency 
matrices, one for each relation. For multilevel networks we by definition have dif- 
ferent relations for different combinations of node-sets. Let the number of relations 
be denoted by Rg, for E = AA,AB, BB, with incidence matrices being defined as 


x”) = (x?) , where x? = 1 if there is a tie on relation r = 0,...,Re — 1 for edge- 


set E = AA,AB, BB. When the number of relations for E = AA, AB, BB differ, we 
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are not able to unambiguously define the multilayered network as a collection of 
one-mode network with blocked, symmetric adjacency matrices. 

For AA, AB, and BB define the binary indicator matrices D44, Dag, and Dgg, each 
of which having elements Dz, of Dg equal to 1 or 0 depending on whether the cor- 
responding tie-variable v is observed or not, respectively. For each E = AA, AB, BB 
the indicators extend straightforwardly to account for more than one relation. Thus, 
for example, if x® represent friendship ties and x) represent advice ties, the cor- 


responding matrices D and Dî) would indicate what friendship and advice ties 
were observed and which ones were not observed. 

We follow the convention (Little & Rubin, 1987) of partitioning data X into ob- 
served Xs = {X, : D, = 1} and unobserved X"ÎS = {X, : D, = 0} data, conditional 
on an outcome D. For a given D we take (X°S, XS) to denote X reconstructed. 


3 Model Formulation 


Frank and Strauss (1986) derived ERGMs for one-mode networks from the so called 
Markov dependence assumption that posited that for any two pairs {i, j} and {k, £} 
of vertices of a graph, the tie-variables X;.jLXxe|]X_(;,j), œo if {ij} O {ko} = 0. 
They proved that the Markov dependence assumption implied a log-linear model for 
the collection of tie-variables that has as its sufficient statistics counts of different 
network ‘configurations’ (incidentally echoing the conclusions drawn by Moreno 
and Jennings, 1938). Snijders et al. (2006) elaborated on the Markov model by 
proposing parameters derived from the so called social circuit dependence assump- 
tion. The general form of ERGM is 


p(X|@) = exp{q(X; 0) — w(@)} 


where the normalising constant y(@) = Yycg exp{g(Y:0)} and q(X; 6) is a po- 
tential dependent on the structure of the network and a vector 0 of statistical param- 
eters. This general form is agnostic to the specific dependencies we may hypothesis 
for a particular type of network object. For undirected one-mode network, the model 
of Frank and Strauss (1986) has the potential written as a weighted sum of sufficient 
graph statistics 


Ps X; 
logg(X:0)=)Y as ( 3) +0r Lo XXX; 
r (i.j.h)e(4) 


where the statistics correspond to two distinct categories of statistics, namely stars 
and triangles (in the expression X;+ = Y;X;j). ERG models have been proposed 
for two-mode networks (Skvoretz and Faust, 1999; Agneessens and Roose, 2008; 
Wang et al., 2009) and multiplex networks (Pattison and Wasserman, 1999). The 
modelling family has also been extended to the joint analysis of ties between differ- 
ent types of nodes (Wasserman and Iacobucci, 1991) and for fully defined multilevel 
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networks by Wang et al. (2013). Wang et al. (2013) factor the function q(X; 0) 


logg(X; 0) = O4z(Xaa) + 0Bgz(XBB) + 04pz(X1B) 
= Osa 8p<(Xaa;X8p) + Ola ag<(Xaa, Xan) + OBB. 4Bz(XBB, XAB) 


= Oda BB ABZ (X4a, Xz, XAB) 


to explicitly allow for different dependencies depending on what edge-sets are 
considered. For example, z(X44) only involve statistics calculated on AA while 
z(X44,88) involve crossed statistics, calculated for ties in (3) x G). With multi- 
ple relations, statistics can be further partitioned, so that the linear predictors take 
into account dependencies between different types of ties between different types 
of nodes. Considering for example the interactions between ties in AA and AB, we 
have 


T RATUR] T O yi) 
S 
Oda agz(Xaa Xan) = YY Olara Xia Xap) 
s=0 t=0 


The interpretation is that a tie of type s among pairs in AA may depend on affiliation 
of nodes in A with nodes in B of type t. Conditional on a realisation X, we assume 
an observation process 


f(D|X, 6)7(6) 


where the parameter $ is distinct (Little & Rubin, 1987) from @. The observation 
process may be thought of equivalently as a missing data generating mechanism or 
a sampling design, such as snowball sampling, for purposes of inference (Handcock 
and Gile, 2010). If we assume that tie-variables are observed conditionally indepen- 
dently conditional on X, f(-) can be modelled as a regular log-linear model with a 
standard link function. Given that D has the same range-space 2 as X, the observa- 
tion indicators can also be modelled using an ERGM. Inference for an informative, 
MNAR process will however be contingent on informative priors. 


4 Estimation 


We build on a recently proposed Bayesian data-augmentation scheme for doing 
inference for one-mode ERGM under the assumption of MAR (Koskinen et al., 
2013). A Markov chain Monte Carlo (MCMC) scheme is constructed by draw- 
ing from the joint posterior of (@,X™*s, €) using updating steps that update from 
(001) xmiss—1) £0-1)) to (00) xmiss.© 40). Conditional on D, @ is updated 
using the approximate exchange sampler (Caimo and Friel, 2011): 


(a)Draw n from h(n|6—!) 


(b)Draw Y from p(Y|n) = exp{4(¥; n) — y(n)} 
(c) With probability min{1,H}, set 7 := @¢-) and 00 := n where 
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_ P(X% X" 4D ny (Na(O mp (Vo!) 
PRA, XBT DOE) (DAMON) 
= exp{q(X°s, x™5\0—-1)- 17) +q(Y; 001) 
=a (KOS, MISLED; 0-1) — g(Y;m)}:r(m)/7(04-!) 


otherwise nl) := n and 6) := 00) 
In (a), h(-) is a symmetric proposal distribution, typically a multivariate Gaussian 
distribution. In the exchange sampler (Murray et al., 2006), updating steps (a) and 
(b) are performed by drawing directly from the conditional distributions in a Gibbs 
update. Generally for ERGM (b) will have to be performed through MCMC, mean- 
ing that the algorithm for drawing from the posterior is not a proper MCMC scheme. 
Koskinen (2008) uses an on-line monitoring algorithm for appraising burn-in with 
properties similar to the automatic convergence guaranteed by perfect sampling 
(Propp and Wilson, 1996). 

Koskinen et al. (2013) propose to update X™*S under the assumption of missing 
at random (MAR) for one-mode networks. Whereas MAR implies 

FIX, AMS, C) = F(DIX™, 6), 

we relax the assumption of MAR and allow for D to depend on all of X. The mod- 
ification of the updating-step for missing data is to draw X""S5 given the rest from 
the full conditional posterior 


exp{q(X, XS: 0) — y(6)}.f (DIX, X™, C)a(6) 
Tym exp, VS; 6) — w(8)}F DIX, Ym, E) eC) 


7e(X™SS | KOS 0) sn 


The conditional distribution of ¢ simplifies to a distribution proportional to 
f(D[Xs, x™ss_¢)7(C). If the distribution f(-) is not fully tractable, draws of ¢ 
cannot be made directly. Assuming that it is straightforward to draw D from f(-), ¢ 
can be updated using steps (a), (b) and (c), with f(-) playing the role of p(-) 


5 Empirical illustration 


We provide a brief empirical case-study using the so-called ‘Noordin Top’ Terrorist 
Network (Everton, 2012) as our assumed true network. The node set A consists of 
n= 79 individuals and B of m = 129 recorded events. The friendship ties reported in 
Everton serve as the ties in AA and the participation in events and operations listed 
by Roberts and Everton (2011) are the ties of the affiliation set AB. To construct ties 
BB among events, we have elaborated on the time-stamped version of Broccatelli, 
Everett and Koskinen (2016) and coded up the explicitly mentioned connections 
between different events and operations in the International Crisis Group Report 
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(International Crisis Group, 2006). For the purposes of illustration, the event-by- 
event network is considered fixed and exogenous. Furthermore, we condition on the 
overall activity of the network, fixing the number of ties in both AA and BB. Con- 
sequently, all analyses have to be interpreted conditionally on the overall number of 
event participations and total number of friendship ties. 

The configurations z(-) are illustrated in Figure 1 and are described in more detail 
in Wang et al. (2014). For the completely observed network, summaries of the poste- 
riors for the corresponding parameters are provided in Table 2. Typical for one-mode 
network we find strong support for triadic closure (the 95% CI for ATA is (0.341, 1)) 
but also strong support for people taking part in events that are functionally related 
to other events that they take part in (the 95% CI for ATA is (0.786, 1.859)). 


ORO 


NY de 


(a) People (b) People (c) Activity (d) Event 
Centralisation (ASA) Clustering (ATA) Assortativity Assortativity 
(Star2AX) (Star2BX) 


(e) Func- (f) Affiliation 
tionality Clustering (XACB) 
Closure of 

Affiliations 

(TriangleXBX) 


Fig. 1: Configurations of multilevel ERGM for Noordin Top (configurations a, b, 
and f are geometrically weighted) 


To provide an example of multilevel snowball sampling, we snowball using Op- 
eration 3 as our seed (this is the 2004 Australian embassy bombing that took place 
on 9 September 2004 in Jakarta, Indonesia, killing 9-11 people and injuring more 
than 150 people). Anyone who participated in this operation is defined as being in 
wave 1, and anyone who is not in wave | but is tied to anyone in wave 1, belongs 
to wave 2. The result in Table 2 are qualitatively the same as for the model with 
completely observed data. 

To provide a a brief example of a MNAR observation process, for each tie- 
variable (i, j), we define independently Pr(D,; = 1|X,¢) = Pr(Dj; = 1|hij(X), ¢), 
where h;;(X) = max{d;(X),d;(X)}, where d;(X) is the distance in X between 
i € A,B and Noordin Top (all ties in BB are assumed fixed and known). We model 
the probabilities Pr(D;; = 1|hi;(X), $) as in Table 1, with the interpretation that ties 
that are further from the leader Noordin Top are less visible that ties close to him. 
The results in Table 1 indicate that effects corresponding to clustering is attenu- 
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ated but degree-related effects are amplified (with the exception of XASA). These 
changes are a natural consequence of the observation process respecting distance 
but not necessarily clustering. 


Table 1: Detection bias in MNAR observation mechanism for Noordin Top 


hij(x) 1 2 3 4 >4 
ny (x) 1122 6360 6090 1190 1750 
Pr(Djj|x) 0.99 0.75 0.5 0.25 0.15 


Table 2: Posterior summaries for ERGM fitted to Noordin Top 


no missing snowball MNAR 
sample 
Effect Mean Std Mean Std Mean Std 
ASA 0.162 0.215 0.160 0.229 0.662 0.264 
ATA 0.673 0.169 0.637 0.177 0.29 0.201 
Star2AX 0.106 0.020 0.106 0.020 0.129 0.021 
Star2BX —0.014 0.046 0.000 0.049 0.022 0.06 
TriangleXBX 1.322 0.273 1.299 0.278 1.191 0.293 
XASA 0.185 0.205 0.337 0.213 0.037 0.212 
XACB 0.106 0.029 0.091 0.035 0.069 0.046 


6 Conclusions and future directions 


We have proposed a statistical approach for analysing the structure of multilayered 
networks that account for imperfections in data. We provide an illustrative example 
of analysis of a multilevel network for three types of observation processes. While 
the approach is consistent when the observation process is known, a MNAR process 
requires making a number of untestable assumptions and is most likely of use merely 
as a sensitivity analysis. Further work is needed in order to systematically investigate 
the sensitivity of MNAR to different plausible MNAR mechanisms. 
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Copula-based segmentation of environmental 
time series with linear and circular components 


Segmentazione di serie storiche ambientali con 
componenti lineari e circolari basata su copule 


Francesco Lagona 


Abstract A novel segmentation method is proposed for the analysis of bivariate 
time series of intensities and angles that often occur in environmental applications. 
The model is based on a mixture of copula-based cylindrical distributions, whose 
parameters evolve according to a latent Markov chain. The model parsimoniously 
accommodates typical features of cylindrical time series such as circular-linear cor- 
relation, multimodality, skewness and temporal auto-correlation. A computationally 
efficient Expectation-Maximization algorithm is described to estimate the parame- 
ters and a parametric bootstrap routine is exploited to compute confidence intervals. 
These methods are illustrated on cylindrical time series of wave heights and direc- 
tions in the Adriatic sea. 

Abstract Si propone una nuova procedura di segmentazione per serie storiche bi- 
variate con componenti lineari e circolari , tipiche delle applicazioni di statistica 
ambientale. Il modello si basa su una mistura di distribuzioni cilindriche, defi- 
nite attraverso una copula, i cui parametri variano secondo l’evoluzione di una 
catena markoviana latente. Il modello integra in modo parsimonioso caratteristiche 
tipiche delle serie cilindriche, quali la correlazione tra osservazioni lineari e cir- 
colari, l’asimmetria, la multimodalità e l’auto-correlazione temporale dei dati. Si 
propone un algoritmo di tipo EM computazionalmente efficiente per la stima dei 
parametri ed una procedura di tipo bootstrap per la stima degli intervalli di confi- 
denza. L’applicazione a una serie storica di direzioni e altezze d’onda nell’Adriatico 
illustra la metodologia. 


Key words: clustering, copula, hidden Markov model, environmetrics, linear-circular 
data, segmentation, waves 
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1 Introduction 


Time series of angles and intensities arise often in environmental research. Recent 
studies have focused, for example, on time series of wind directions and pollutant 
concentrations [2], wind directions and speeds [7], wave directions and heights [6]. 
Bivariate sequences of angles and intensities are often referred to as cylindrical time 
series, because the pair of an angle and an intensity can be represented as a point on 
a cylinder. 

The analysis of cylindrical time series has been overlooked due to the special 
topology of the support on which the measurements are taken (the cylinder), and to 
the difficulties in modeling the cross-correlations between angular and linear mea- 
surements over time. Further complications arise from the skewness and the mul- 
timodality of the marginal distribution of the data. Indeed, intensities are typically 
negatively skewed and directional data are rarely symmetric; multimodality may 
arise as well as the data are often observed under heterogeneous conditions that 
vary over time. 

This paper introduces a dynamic mixture of copula-based cylindrical distribu- 
tions that parsimoniously accounts for the specific features of cylindrical time se- 
ries. More precisely, we first introduce a cylindrical density as a joint distribution of 
a von Mises and a Weibull distribution by means of a circular copula and then ap- 
proximate the data distribution with a mixture of these cylindrical densities, whose 
parameters depend on the states of a latent Markov chain. This approach flexibly 
extends previous proposals that are either based on mixtures of conditionally inde- 
pendent linear and circular densities [4] or based on mixtures of Abe-Ley cylindrical 
densities [6]. It provides a unified framework where distributions that are typically 
exploited in environmental applications are jointly integrated. It is additionally nu- 
merically tractable, by exploiting a suitable Expectation-Maximization (EM) algo- 
rithm for parameter estimation. 


2 A copula-based cylindrical distribution 


A cylindrical sample is a pair z = (x,y), where x € [0, 277) is a point in the circle and 
y is a point on the positive semi-line [0, +20). Let Jp be the modified Bessel function 
of order 0. If the circular component x and the linear component y are respectively 
drawn from a von Mises distribution with density 


_ exp(Kcos(x— u)) 
2710(K) È 


f(x: bk) 


and from a Weibull distribution with density 


f(y;0,B)= n (5) se (3). 
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and if g(u) is a a von Mises density with parameters LU; and Ke, say 


exp( Ko cos(u — Ue )) 
2701p (Ke) i 


8(US Me, Ke) = 
then the copula-based density 


F(z: 0) = 278 (27 (F(x) + F(y))) SSO) (1) 


is defined on a cylindrical support up to the parameter vector 0 = (U, K, Œ, B, Uc, Ke) 
and has f(x; u, K) and f(y; @,B) as marginal densities [3]. 


3 Dynamic mixtures of cylindrical densities 


Let zo:7 = (Zr, t = 0,...,T), Zt = (x,y), x € [0, 27), y, € [0, +) be a cylindrical 
time series. We assume that the distribution of the data is driven by the evolution of 
an unobserved Markov chain with K states, which represents (time-varying) latent 
classes and can be specified as a sequence o.r = (6,,1=0,...,7) of multinomial 
variables &, = (6,1... éx) with one trial and K classes, whose binary components 
represent class membership at time f. The joint distribution p(é0.7:p) of the chain 
is fully known up to a parameter p that includes K initial probabilities py = P(&o, = 
1),k=1,...,K,Y, px = 1, and K? transition probabilities pax = P(Éx 1|&-1h 
1),h,kK=1,...,K,X%Pnk = 1. Formally, we assume that 


K TKK è 
plor: p) = TT re TTT pr (2) 
=1h=1k=1 


k=1 h: 


The specification of the model is completed by assuming that the observations are 
conditionally independent, given a realization of the Markov chain. As a result, the 
conditional distribution of the observed process, given the latent process, takes the 
form of a product density, say 


K 


f(zo:r]80:r381---0x) = TTL f(z 0), (3) 


t=0k=1 


where f(z; 0;,),k =1,...,K are the K cylindrical densities defined by (1) and known 
up to a vector of parameters O+. The likelihood function of the model is therefore 
obtained by integrating the joint density of the observed data and the unobserved 
class memberships with respect to the segmentation o.r, namely 


L(p.0;z0r)= Y Plor: P) (20-718 9-7 3 41--- x). (4) 


Sor 
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By computing the maximum likelihood estimate 6, the cylindrical time series can be 
then segmented according to the posterior probabilities of class membership f; = 
P(&x = 1|zo:7; 8), based on 8. 


4 Estimation 


An EM algorithm is proposed to maximize the likelihood function (4). It is based 
on the following complete-data log-likelihood function 


K TR E 
log Lcomp(9, $0:7:z0:7) = È Goxlogpa+ È} YY &-1,164108 Prk 
i=l tl hal kal 


T K 
+È Y Slog f(z; Ox). (5) 


t=0k=1 


The algorithm is iterated by alternating an expectation (E) and a maximization (M) 
step. Given the estimates p, and 6 s, obtained at the end of the s-th iteration, the 
(s+ 1)-th iteration is initialized by the E-step, which evaluates the expected value 
of (5) with respect to the conditional distribution of the missing values é, given the 
observed data. 

The E step reduces to the computation of the univariate posterior probabilities of 
each latent state at time t, Pik = P(&x = 1 | 20:7, Bs, 6,) k=1...K,t=0...T, and 
the computation of the bivariate posterior probabilities of each pair of states in two 
adjacent times, say Ĥr—1 sak = P(&-1n= 1, Ek = 1|Z0:7,Py,95) hk=1...K,t= 
1... T. The task of computing these posterior probabilities from an estimate (p,, 6,) 
is generally referred to as the HMM-smoothing numerical issue and it is typically 
solved by specifying the posterior probabilities in terms of suitably normalized func- 
tions, which can be computed recursively, avoiding unpractical summations over the 
state space of latent Markov chain and numerical under- and over-flows [1]. 

The M-step of the algorithm updates the estimate (p,, ô.) with a new estimate 
(Psi 6 s+1), by maximizing the expected value of (5), obtained from the previous E 
step. This expected value is the sum of functions that depend on independent sets of 
parameters and can therefore be maximized separately. Maximization with respect 
to the transition probabilities ppg, under the constraints ys Png = 1,h=1...K, 
provides the closed-form updating formula 


p ae LT, Br—11nk(Ps95) 

hk(s+1 n pag na 
i Pr—-1n(Bs, 9s) 

Maximization with respect to the parameters 0; of the K copula-based cylindrical 


components reduces to a traditional IFM (inference function for margins; [5]) rou- 
tine, implemented on a weighted augmented datasets of n x K observations, where 


hk =1,...,K. 
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each observation is replicated K times and weighted by the K univariate posterior 
probabilities fx, computed during the previous E step. 

The procedure outlined above does not produce confidence intervals of the esti- 
mates, which however can be computed by taking a parametric bootstrap approach. 
In this paper, the model was re-fitted from R = 400 bootstrap samples, simulated 
from the estimated model parameters, and the 2.5% and the 97.5% quantiles of the 


empirical distribution of each bootstrap estimate was computed. 


Table 1 Parameter estimates and bootstrap quantiles under three segmentation classes 


Class 1 parameter estimate 2.5% quantile 97.5% quantile 


a 2.61 2.16 2.75 
B 0.81 0.72 0.86 
K 1.23 0.80 1.31 
u 1.13 1.05 1.19 
Ke 0.98 0.89 1.08 
Me 0.52 0.49 0.62 
Class 2 parameter estimate 2.5% quantile 97.5% quantile 
a 2.41 2.26 2.58 
B 2.99 2.81 3.17 
K 0.26 0.21 0.60 
u 0.22 0.11 3.81 
Ke 2.18 1.98 3.19 
Me 0.72 0.49 0.92 
Class 3 parameter estimate 2.5% quantile 97.5% quantile 
a 2.18 2.05 3.26 
B 2.10 1.92 2.47 
K 1.95 1.48 2.72 
u 2.03 1.87 2.11 
Ke 4.01 2.98 5.01 
Me 0.12 0.04 1.01 
destination 
origin Class 1 Class 2 Class 3 
Class 1 0.975 0.018 0.007 
Class 2 0.000 0.990 0.010 
Class 3 0.010 0.016 0.974 


5 Application: regimes of wave in the Adriatic sea 


The proposed methods have been implemented to segment a time series of semi- 
hourly wave directions and heights, recorded in the period 2/15/2010 - 3/16/2010 
by the buoy of Ancona, located in the Adriatic Sea at about 30 km from the coast. 
Segmentation of these data according to meaningful environmental regimes is often 


574 F. Lagona 


required in studies of the drift of floating objects and oil spills, in the design of 
off-shore structures and in studies of sediment transport and coastal erosion. 

A number of models have been estimated from these data, by varying the num- 
ber K of components from 2 to 4, and the BIC statistic suggested to segment the 
data according to 3 regimes. Table 1 displays the estimates under these three latent 
classes, along with bootstrap percentiles, computed by simulating 400 samples. 

The first component of the model (class 1) is associated with high waves coming 
from North. These waves are generated by northern Bora jets that blow along the 
major axis of the Adriatic basin. Under a Bora episode, most of the wind energy is 
transferred to the sea surface and, as a result, most of the data with the highest waves 
in the sample are clustered within this regime. The second component of the model 
(class 2) is associated with periods of calm sea. Under this regime, moderate waves 
are uniformly distributed around the circle of directions. The third component (class 
3) is associated with Sirocco episodes. In this regime, waves travel southeasterly 
along the major axis of the Adriatic basin, driven by winds that blow from a similar 
directional angle. 

The rows at the bottom of Table 1 include the estimated transition probabilities of 
the latent Markov chain. The transition probability matrix is essentially diagonal, re- 
flecting the temporal persistence of the classes. In particular, the small off-diagonal 
transition probabilities between class 1 and 3 indicate that direct transitions between 
Sirocco and Bora episodes are very unlikely. The segmentation model hence con- 
firms that the sea surface in the study area tend to alternate relevant marine events 
with periods of good sea conditions. 


Acknowledgements This work is developed under the PRIN2015 supported-project ‘’Environ- 
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A Multiscale Approach to Manifold Estimation 
Un Approccio Multiscala alla Stima di Varietà 


Alessandro Lanteri and Mauro Maggioni 


Abstract In presence of high-dimensional data, sampled from an unknown distri- 
bution u on R?, it is common to assume that the support of u is approximately 
a d-dimensional set, for example the d-dimensional Riemannian manifold. We in- 
troduce a novel technique to estimate underlying structure of the data using an al- 
gorithm which approximates the manifold with a collection of hyperplanes. This 
is done in a multiscale fashion, using a subspace clustering algorithm iteratively. 
The proposed approach is data-adaptive and, by construction, provides a tree struc- 
ture for the data. The performance of the proposed method is evaluated both with 
synthetic and real data showing promising results. 

Abstract In presenza di dati ad alta dimensionalità, estratti da una distrubuzione u 
sconosciuta in RP, è usuale assumere che il supporto di u sia approssimativamente 
un set d-dimensionale, ad esempio una varietà Riemanniana d-dimensionale. In 
questo lavoro viene introdotta una nuova tecnica per stimare la struttura sottostante 
ai dati usando un algoritmo in grado di approssimare la varietà con una collezione 
di iperpiani. Il metodo viene eseguito in maniera multiscala utilizzando iterativa- 
mente un algoritmo di subspace clustering. L'approccio proposto è data-adattivo e, 
per costruzione, genera una struttura ad albero per i dati. Il metodo viene testato 
tramite l’utilizzo di dati sintetici e reali e produce risultati promettenti. 


Key words: Manifold Learning, Dictionary Learning, Multi-Resolution Analysis, 
Adaptive Approximation, Data Encoding 
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1 Introduction 


In recent years, due to the huge amount of data that new technologies provide, 
modern science has become dependent on reliable methods to deal with high- 
dimensional data. Several difficulties arise when dealing with this kind of data. One 
of the main problems is that when data dimension grows, the volume of the space 
grows exponentially, leading to the available data being scattered sparsely in space, 
with gaps between data points growing also exponentially in the dimension. This 
behavior may cause problems in statistical analysis, since the data needed to give 
statistical significance to the analysis usually grows exponentially with the dimen- 
sion. This problem is commonly referred to as the curse of dimensionality and it 
is the main reason why high-dimensional data can not be analyzed efficiently with 
traditional methods. To tackle this issue new techniques for dimensionality reduc- 
tion have been proposed in literature. Most of them are based on the assumption 
that there is a low-dimensional model which approximates properly the true high- 
dimensional distribution from which the data was generated. One of the most used 
methods which exploits a geometrical assumption on the data is principal compo- 
nent analysis (PCA) [10]. Notwithstanding its popularity in applications, the as- 
sumption of PCA that data lies close to a linear variety is often unsatisfied. In the 
past decade much work has focused on replacing the linear variety assumption with 
that of a nonlinear manifold ⁄ in R? [5, 11, 2, 6, 4, 9]. In this work we focus 
on multiscale techniques for manifold learning and dictionary learning, in the same 
direction as Geometric Multi-Resolution Analysis [1, 9, 7]. 


2 The algorithm 


Let X := {x;}’L, be a set of n samples from a probability measure u in RP. We as- 
sume that u is supported near a set ./, e.g. a manifold, of dimension d < D. Here 
for “near” we mean that the data may be corrupted by noise, or perhaps the data 
is not quite distributed on a manifold, but close to one (model error) [9]. Our goal 
is to learn a data-dependent dictionary that efficiently encodes data sampled from 
LU. We proceed in a multiscale fashion, in order to have approximation at different 
scales, with increasing accuracy as the scales get finer. We describe our procedure 
in Algorithm 1. The intuition behind our method is that a Riemannian manifold 
can be locally approximated by a d-dimensional plane, thus the underlying struc- 
ture of the data can be approximated by a suitable collection of planes. The main 
tool used in our method is the MAPA algorithm [3], a subspace clustering algo- 
rithm proposed to solve the plane arrangement problem using a Multiscale Singular 
Value Decomposition approach [8]. Given a set of samples lying on a collection of 
d-dimensional planes with different dimension, MAPA can reconstruct the model 
estimating the number of planes, their dimension and how they are arranged in the 
ambient space. However, when MAPA is applied to data sampled from a manifold 
AM, it fails to approximate properly the manifold because of curvature. Still, the 
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algorithm will produce a coarse approximation of .@ using up to K d-dimensional 
affine planes {7x}}_,. The data X may be clustered into K disjoint subsets {X; x}{_, 
using the minimal point-plane distance, i.e. assigning x to argmin,d(x, P,), with Py 
the orthogonal projection onto the affine subspace 7. In rMAPA we now recurse 
on each of these subsets: MAPA is applied independently on each X; x, generating 
a further, finer scale family of subsets Xx, each approximated by another set of 
d-dimensional planes. Figure | provides a pictorial representation of this construc- 
tion. This process can be iterated until a desired precision in the estimation of .# 
is achieved: in this way the process is made adaptive, meaning that if a region of 
the manifold is irregular it will be approximated by several planes, while a rather 
flat region will be approximated by a moderate number of planes. Examples of full 
manifold reconstruction are shown in Figure 2. This approach, in comparison to a 
uniform approach, leads to a reduction of the number of planes used to approximate 
the manifold while maintaining the same level of precision and, of course, decrease 
the running time of the algorithm. Another feature of the proposed method is that it 
produces a tree structure for the data. Each scale of approximation corresponds to a 
level of the tree, with each cluster represented by a tree node. 


Fig. 1 The multiscale nature 

of the Algorithm 1. In the first 

step, the Swissroll manifold 

is roughly approximated by $ 
five planes. In the second ei 
step, the manifold is divided 

in five subsets and each of 


them (here we show only 
one subset) is again approx- 


imated by another set of five 
planes in order to get a finer 2 
approximation. This process J ) - ) 


may be iterated as long as the 
number of points permits it, 


or until the desired precision D D | Da 


is achieved. 


Algorithm 1 rIMAPA 


Require: data X, intrinsic dimension d, precision K. 

Ensure: È,, : piecewise linear projectors k = 1, ..., Ks for each scale s = 1,...,S. 

: Apply MAPA to X and obtain Kj piecewise linear projector P,;, with {= 1,..., Ki. 

: Form Kj clusters X; j, with j = 1,..., Kı from X using the minimal point-plane distance. 

: Apply MAPA to the obtained clusters in order to obtain K; clusters and their piecewise linear 
projections. 

4: Repeat Step 3 until, at scale S, 


WNT 


Xs x — Ê x||? < K is obtained for each pair (S,k). 
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3 Experiments 


In this section we show some empirical results on the performance of the proposed 
algorithm on synthetic data. We draw 10° points uniformly on a d-dimensional man- 
ifold. These points are then embedded and randomly rotated in R? with D = 30. 
Each point is corrupted with additive Gaussian noise N; ~ N (0,/p). The method 
is tested in different settings, varying the manifold type, a Swissroll or an S- 
Manifold, and the noise level o. In all settings we requested the precision parameter 
K to be equal to ø in order to avoid overfitting. Results in Table 1 show that the 
methods always achieve a precision higher than the requested x, this means that 
it acted as a denoiser in these settings. The number of planes used is rather small 
and the computational time is reasonable compared to other methods. For instance, 
we ran LLE [11], a standard algorithm for dimensionality reduction, on a Swissroll 
with only 104 point embedded in R° with o = 10-4 and it took about 2 hours to 
reconstruct the manifold, while our method took an average of 11 seconds to com- 
plete the same task, on a data matrix six hundred times bigger than the one used on 
the LLE algorithm. From this experiment we also note that the number of planes 
needed to estimate the S-Manifold is lower than those needed for the Swissroll to 
achieve the same error. This follows from the fact that, as it is shown in Figure 2, 
the S-Manifold have more flat regions than the Swissroll and the algorithm exploits 
this feature and optimizes the number of planes needed for the approximation. 


Fig. 2 Reconstruction of three different manifold, a Swissroll, an S-Manifold and an SZ-Manifold, 
using Algorithm 1. Here the adaptive nature of the algorithm can be appreciated. This can be 
appreciated in particular at the last scale of the SZ-Manifold approximation, where the flat region 
is well approximated by a single plane, while the curvy regions are approximated by several planes. 
This feature is more visible at last scale approximation because at lower scales there is not yet a 
single plane which alone can approximate the flat region better than a union of planes. 
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Table 1 Approximation error (MSE) in different settings. A Swissroll and an S-Manifold, n = 10°, 
D = 30, Gaussian noise N; ~ No) (0,/p) with o € {0.001,0.05,0.1} and precision x = ø. Results 
are averaged over 100 iterations on a test set of sample size n. Values in parentheses are standard 
errors. For the sake of readability, MSE values are multiplied by 103. 


o x 103 Swissroll S-Manifold 
MSE x 103 0.19 (0.13) 0.34 (0.31) 

1 Planes“ 25.23 (9.56) 14.12 (4.83) 
Time? 70.22 (24.41) 41.25 (23.05) 
MSE x 10° 1.70 (1.10) 3.07 (0.96) 

5 Planes” 19.14 (10.96) 6.85 (2.54) 
Time? 48.82 (29.17) 5.28 (9.77) 
MSEx103 7.30 (2.27) 7.63 (0.96) 

10 Planes“ 9.38 (8.02) 5.31 (1.27) 
Time? 10.09 (21.39) 2.23 (3.39) 


@ Number of linear projectors used for approximation. ? Running time in seconds. 


4 Application to Real Data 


We consider the MNIST data set from http://yann.lecun.com/exdb/mnist/, which 
contains images of handwritten digits, each of size 28 x 28, grayscale (i.e. in di- 
mension 784). Sample size is 60000 for the train set and 10000 for the test set. This 
data set is very interesting since its intrinsic dimension varies for different digits 
and across scales, as it was observed in [8]. In Figure 3, we compare our method 
to a projection on the first several principal component of the train set. To approx- 
imate a digit in our multiscale approach, we project the point (i.e. the digit) on the 
5-dimensional plane closer, in orthogonal projection. We chose d = 5 because it has 
been shown in [8] and [7] that the dimension of each digit should not exceed this 
number. In Figure 3 it is appreciable how we can obtain good approximation, en- 
coding data using several low-dimensional planes instead of one high-dimensional 
plane. Like we mentioned before, this feature is very important to tackle the curse 
of dimensionality when doing inference. 


5 Conclusion 


We proposed a data-adaptive multiscale algorithm for manifold learning which ex- 
ploits a subspace clustering method. The algorithm also produces a tree structure 
for the data, which is a useful feature per se. Numerical experiments show that this 
methods is very fast to handle a big number of samples embedded in high dimen- 
sion and succeeds in giving the requested approximation error encoding the data 
with a collection of linear projectors, even when the data is corrupted with noise. 
The number of projectors is limited, this thanks to the adaptive nature of the algo- 
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PCAS d=5 K=3 d=5 K=5 d=5 K=11 d=5 K=33 real data PCAS 
PCA10 PCA 20 PCA 30 PCA 40 PCA 50 PCA 784 PCA 10 PCA 20 PCA 30 PCA 40 PCA 50 PCA 784 


Fig. 3 Approximation of two digits, a nine and a six, from the MNIST test set. First row is the 
reconstruction with the proposed multiscale approximation with planes dimension d = 5 and dif- 
ferent number of planes K. In second row are shown results from the projection of the digits on the 
first several principal components of the train set. Note that a PCA 5 is equivalent to the proposed 
algorithm with d= 5 and K = 1. 


d=5 K=3 d=5 K=5 d=5 K=11 d=5 K=33 real data 


rithm, which automatically finds flat regions which can be approximated with less 
planes. An application to the MNIST data showed how the proposed algorithm suc- 
ceeds to approximate and encode the digits with a collection of low-dimensional 
planes with an accuracy comparable to a high-dimensional encoder which uses the 
first several principal components. 
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Using scanner and CPI data to estimate Italian 
sub-national PPPs 


L’uso degli scanner data e dei dati dell’indagine 
tradizionale sui prezzi al consumo per la stima delle PPA 


a livello intra-nazionale in Italia 


Tiziana Laureti, Carlo Ferrante, Barbara Dramis 


Abstract 

The recent availability of high-frequency electronic-point-of-sale “‘scanner data’ 
together with the traditional Consumer Price Index (CPI) data has the potential to 
significantly change how to compile spatial price indexes. Indeed, one of the main 
issues when constructing sub-national Purchasing Power Parities (PPPs) is to obtain 
price data from multiple sources and outlets, which are representative of local 
consumption patterns and comparable on the basis of a set of price determining 
characteristics. 

Within this framework, the aim of this paper is to suggest a new stochastic 
methodological approach to index numbers based on the Country-Product-Dummy 
(CPD) method for computing sub-national PPPs in Italy. This approach enables us to 
use both scanner data and traditional CPI data coming from large-scale retail trade. 
By using millions of price records concerning food and grocery products collected in 
supermarkets and hypermarkets of the most important chains of modern distribution 
located in the 20 regional chief towns in Italy, we provide estimates of Basic 
Heading (BH) PPPs for the year 2015. By using the CPD based stochastic approach 
we are able to obtain reliability measures for sub-national PPPs as well as to address 
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methodological issues concerning the choice of the aggregation methods above the 
BH level. 


Riassunto 

La recente disponibilità di "dati scanner" va ad arricchire le informazioni 
provenienti dall’indagine tradizionale per la costruzione degli Indici dei Prezzi al 
Consumo e può significativamente cambiare il modo di costruire gli indici spaziali 
dei prezzi. 

Infatti, per la stima delle Parità del Potere di Acquisto (PPA) è necessario disporre 
di dati sui prezzi al consumo di prodotti e servizi che siano comparabili a livello 
spaziale e rappresentativi dei consumi locali. 

L'obiettivo del presente lavoro è di suggerire un approccio metodologico di tipo 
stocastico basato sul metodo Country-Product-Dummy per il calcolo delle PPA a 
livello intra-nazionale in Italia. Tale metodo consente di utilizzare 
contemporaneamente sia i dati scanner che i dati tradizionali con riferimento alla 
Grande Distribuzione Organizzata (GDO). 

Utilizzando milioni di osservazioni relative a prezzi e quantità di prodotti alimentari 
e di prodotti per l’igiene della persona e la pulizia della casa, rilevate nei 
supermercati e ipermercati delle più importanti catene della GDO situati nei 20 
capoluoghi di regione, il lavoro presenta le stime delle PPA per sottoclassi di 
prodotto per l'anno 2015. Il metodo utilizzato consente di costruire intervalli di 
confidenza per tali indici, nonché di esaminare diversi metodi per aggregare le PPA 
delle sottoclassi di prodotto. 


Key words: scanner data, sub-national Purchasing Power Parities, Country Product 
Dummy models 


1 Introduction 


Spatial price indexes that measure the differences in price levels across regions 
within a country are essential for comparing real income, standards of living and 
consumer expenditure patterns. In countries characterized by large territorial 
differences in consumer preferences as well as in quality of products and household 
characteristics, the calculation of sub-national Purchasing Power Parities (PPPs) 
acquires great importance. 

The Italian National Institute of Statistics (Istat) is one of the few National 
Statistical Offices (NSOs) that has carried out official experimental sub-national PPP 
computations by using price data from Consumer Price Indexes (CPIs) and ad hoc 
surveys and focusing on comparing consumer prices across the 20 Italian regions. 
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Significant price differences were found in 2010 which encouraged Istat to confirm 
the project for producing sub-national PPPs on a regular basis (Biggeri et al., 2016). 
The recent availability of high-frequency ‘‘scanner data’’ in addition to other sources 
of data enables us to deal with the sub-national PPP issue from a renewed approach, 
thus increasing cost efficiency and reducing burden response. Indeed, the way in 
which CPI data are collected often complicates the estimation of spatial price 
differences as products collected for CPI may not be comparable or representative 
across different areas, especially when the areas within a country differ in terms of 
climate, tastes and preferences. Moreover, ad hoc surveys are generally very 
expensive for the NSOs and do not provide information on consumption expenditure 
as in the case of CPI data. Within the European Multipurpose Price Statistics project, 
Istat has been exploring the possibility of using scanner data for computing official 
CPIs since 2014 and recently for compiling PPPs. 

However, scanner datasets provide both opportunities and challenges for price 
statisticians since they must deal with huge amounts of highly detailed data on 
consumer purchases with high variability of products sold among cities. 

Therefore, nowadays it is essential to determine how best to use scanner data and 
how to combine them with other data from various sources in order to construct sub- 
national PPPs as there is growing interest worldwide in using these data for 
compiling official price statistics but as yet research has been mainly focused on 
using scanner data for compiling CPIs in order to measure inflation rates. 

The aim of this paper is to contribute to the advancement of spatial price index 
literature by exploring this new source of price data together with CPI data in order 
to compute Italian sub-national PPPs at Basic Heading (BH) level. We propose 
using a stochastic approach to index numbers based on the Country-Product-Dummy 
(CPD) method since it enables us to use both scanner data and traditional CPI data 
obtained from the large-scale retail trade and obtain reliability measures for sub- 
national PPPs. Interesting results have been obtained from numerous experiments 
using both sources of data even if, due to lack of space, we will only focus on a few 
of them in order to illustrate the potential of the proposed methodology and highlight 
the possible informative results. 

The paper is structured as follows. Features of Italian scanner and CPI data and 
the results of the analyses carried out are presented and discussed in Section 2. The 
methodology used is described in Section 3 while in Section 4 some of the results 
obtained using various CPD models are presented and discussed for a set of BHs. 


2 Scanner and CPI data analyses 


Scanner data obtained from electronic points of sale benefits from an impressive 
coverage of transactions along with the availability of information on sales, prices, 
quantities sold and quality characteristics of products sold (brand, size and type of 
outlet) provided at the level of the barcode or, more precisely, the GTIN (Global 
Trade Item Number, formerly EAN code). 
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Currently, scanner data predominantly replaces price collection in supermarkets, 
especially for food, beverages and personal and home care products. 

However, there are also several potential drawbacks in using scanner data as they 
are characterized by a high attrition rate of goods and volatility of the prices and 
quantities due to sales. Indeed, this new source of data are able to capture frequent, 
and often large, shifts in quantities purchased in response to price changes. 
Moreover, using highly detailed data on consumer purchases implies that when 
computing price indexes it is essential to deal with the issue of aggregation of 
individual items from both a theoretical and a practical perspective. 

Over the last decade an increasing number of studies have been carried out with 
the aim of analysing the impact of different aggregation methods on inflation 
estimates by using scanner data as various NSOs are interested in using these data in 
their official price statistics (see for example Ivancic et al, 2011). 

However, to the authors’ knowledge as yet few studies have used scanner data 
and carried out experiments on aggregation issues when comparing consumer prices 
across space (Heravi, et al, 2003; Laureti and Polidoro, 2016). 

Although time dimension of aggregation should not be difficult to deal with when 
comparing price levels across areas within a country, much attention should be paid 
when aggregating transaction price and quantity data for constructing spatial price 
indexes as the choices made will reflect different implicit assumptions regarding 
consumer behaviour. By referring to a specified set of geographical areas within a 
country, i.e. regions or cities, transactions can be aggregated over different items, 
stores and time periods. 

In Italy, scanner data is collected weekly (approximately 1 million records) and 
after a process of data cleaning and trimming outliers, they are used to compute unit 
value price per item code calculated as the total turnover for that item code divided 
by the total quantities sold over the week. 

In order to understand how best to aggregate the detailed information contained in 
the Italian scanner data for constructing sub-national PPPs, several analyses were 
carried out. More specifically, by referring to each Italian regional chief town, we 
carried out ANOVA and t tests on a sample of items in order to verify if the price of 
the same item could reflect auxiliary services provided by the seller. Indeed, within 
each city, the same item is found in different supermarket chains and in different 
stores which belong to the same retail chain. Results show significant differences in 
prices of the same items thus suggesting product differentiation which is embodied in 
the range or quality of services offered by different retailers, both across chain and 
across stores within the same chain. Moreover, by calculating skewness measures 
and Gini coefficients, we found that the distribution of expenditures within a product 
category is usually highly skewed and a relatively small number of items account for 
the majority of expenditures (see Table 1). Therefore, with respect to item 
groupings, we decided to use the finest classification of item that is available within 
the BH, i.e. the product code, which is identical across the Italian territory. As 
regards the time dimension, since the International Comparison Program (ICP) relies 
on a national average price for each item below the BH, we decided to use annual 
regional average prices which are obtained by aggregating the weekly price of each 
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EAN code by considering outlet-type (hypermarket or supermarket of a specific 
chain) and modern distribution chains for the 20 regional chief towns and by using 
turnover and quantity as weights. In this way we can mitigate the effects of the large 
fluctuations in quantities purchased in response to price discounts which still emerge 
using monthly average prices. The dataset consists in 2,799,320 annual price quotes 
from the 20 regional chief towns concerning the six most important modern 
distribution chains. 

All results obtained cannot be reported due to lack of space, therefore we have 
selected a few BHs and their Gini coefficients and the number of different items sold 
in each regional chief town are shown in Table 1. Significant differences in Gini 
coefficients and number of products can be observed both between the various BHs 
within the same regional chief town and across cities within the same BH. 


Table 1: Descriptive statistics by regional chief towns and BHs 


Regional chief Household Cleaning and 
towns Mineral or spring water Personal care products maintenance products 
N.Items Gini N.Items Gini N.Items Gini 
North 
Aosta 31 0.628 1180 0.539 709 0.468 
Torino 254 0.757 2918 0.661 459 0.604 
Genova 83 0.716 1077 0.672 470 0.603 
Milano 258 0.797 2930 0.76 477 0.746 
Trento 79 0.542 789 0.587 413 0.508 
Venezia 97 0.724 2234 0.638 216 0.573 
Trieste 05 0.593 890 0.506 588 0.518 
Bologna 204 0.771 2458 0.67 189 0.652 
Centre 
Firenze 47 0.827 387 0.747 669 0.721 
Ancona 89 0.727 814 0.648 077 0.578 
Perugia 88 0.784 412 0.731 768 0.682 
Roma 270 0.752 2428 0.692 223 0.623 
South and Islands 
L'Aquila 49 0.698 984 0.594 564 0.467 
Campobasso 17 0.666 703 0.587 456 0.455 
Napoli 75 0.709 589 0.678 877 0.622 
Potenza 89 0.693 470 0.554 307 0.496 
Bari 51 0.716 611 0.673 787 0.595 
Catanzaro 66 0.607 602 0.579 335 0.559 
Palermo 22 0.678 390 0.639 758 0.594 
Cagliari 36 0.738 795 0.611 887 0.565 


The scanner data set excludes perishables and seasonal products such as vegetables, 
fruit and meat since these products are sold at price per quantity and are not pre- 
packaged with EAN codes. Therefore, we integrated the scanner data with prices for 
fruit and vegetables which are traditionally collected for CPI production in modern 
distribution. After data quality controls and preliminary analyses of the basket, the 
CPI dataset includes annual average prices for 151 vegetable products collected in 
the 20 regional chief towns. For both scanner and CPI data, the items for which price 
data were collected in a single regional chief town only were eliminated in order to 
ensure that the incomplete price tableau is connected and therefore the CPD method 
is feasible. 
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3 Methodology and Empirical strategy 


PPP compilation is undertaken at two levels, viz., at BH level, which is defined as a 
group of similar well-defined goods or services, and at a more aggregated level. 
Price data are usually aggregated at BH level without weights to produce PPPs for 
various BHs which are then aggregated to obtain PPPs for higher level aggregates. 

This paper focuses on the first stage of aggregation, thus obtaining estimates of 
sub-national PPPs at BH level, because BHs are the foundations of overall 
comparison and it is essential to obtain reliable PPP estimates ( Biggeri et al., 2016). 

From a methodological point of view, we refer to the relatively new strand of the 
stochastic approach to spatial price indexes with its roots in hedonic and CPD 
regression-based methodology, which is also used by the ICP at the World Bank. 

Several authors have demonstrated that, thanks to its econometric nature, the 
CPD method could be extended and generalized in order to provide a comprehensive 
framework for carrying out international and intra-national comparisons using price 
data from various sources and also allows for the computation of standard errors for 
the various spatial price index methods which can be derived from variants of the 
CPD model (Rao and Hajargasht, 2016). 

Due to the type and characteristics of the scanner and CPI data as well as the 
results of our preliminary analyses it is advisable to use hedonic CPD models 
through which information on the type of outlet and retail chain may be considered 
when constructing sub-national PPPs. Moreover, in order to account for the 
economic importance of each item in its market, which is essential in index number 
literature as demonstrated also by our analyses, we estimated weighted hedonic CPD 
models using both expenditure share and quantity as weights. 

Let us assume that we are attempting to make a spatial comparison of prices 
between R areas (i.e regional chief towns) at BH level and p denotes the annual 
price of item n in outlet k of area r (n = 1, 2,...N; r = 1, 2,..., R; k = 1,..., Kw). 
Assuming that Zi, Z2, ..., Zy represent the set of quality characteristics associated 
with each item, thus the hedonic CPD model estimates the following regression 
equation separately for each BH: 


In Dy», Fab, i Şap H Sez HV pro (1) 
= JE 
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where, D, are dummies for the areas, D* are dummies for the type of product, a 
and c, are, respectively, the difference of (fixed) effects associated to the areas, type 
of product and quality characteristics with respect to a specific item; vy, are random 


disturbance terms which are independently and identically (normally) distributed 
with zero mean and variance o° . 
With reference to the scanner data we used the hedonic weighted CPD, expressed as: 
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where wy are expenditure and quantity weights reflecting the relative importance of 
different items. The parameter a, is interpreted as the average level of prices (over 
all items in the BH) in area r relative to other areas while bn is to be interpreted as 
the average (over all areas) premium that item n is worth relative to an average item 
in this BH. If a, is expressed relative to a reference area (in our case Rome=100), 


then the PPP for area r is given by PPP =e where a, is the difference between the 


coefficient for area r and that corresponding to the reference area (Rome). 


4 Results 


We ran hedonic CPD and hedonic weighted CPD for all available BHs in scanner 
data using expenditure and quantity as weights. Sub-national PPP results are only 
reported for the “Personal care products” BH (Table 2). Moreover, Table 2 shows 
PPP results for the “Fresh and chilled vegetables” BH based on CPI data referring to 
the six most important modern distribution chains (first two columns of Table 2). 
Significant differences can be observed between the results obtained from the 
hedonic CPD and WCPD yet similar results can be observed for the two expenditure 
and quantity WCPD models. These results appear to be coherent with our 
expectations and the territorial characteristics of the Italian macro areas. 


Table 2 CPD Estimation results: PPP estimations for regional chief towns by BH, ROME=100 


Fresh and chilled vegetable Personal care products 
HEDONIC WCPD HEDONIC WCPD 

HEDONIC CPD HEDONIC CPD (weights=expenditures) (weights=quantity) 

PPPs Sig. PPPs Sig. PPPs Sig. PPPs Sig. 
North 
Aosta 100.44 01.43 ma 02:66, « + FEE 02.02 *** 
Torino 101.50 96.37, “met 97.68 *##* OTIA ERRE 
Genova 108.20 ** 03.96; RFF 00.31 00.37 
Milano 92.45 4% 98.01 *** 00.15 00.12 
Trento 94.14 *** 0326: p #4 02.39 *##* 02.38 *##* 
Venezia 108.26 *** 95.14 ma OTRS REF 97.70. *** 
Trieste 99.42 02.16 *** 02:92; Et 02.72 *#* 
Bologna 103.47 99.63 * 00.61 *** 00.98 *** 
Centre 
Firenze 100.98 89.72 AIR 85.22 *#* 83.54 *E 
Ancona 108.22 *** 00.04 00.16 00.49 * 
Perugia 101.57 00.51 ** 97.74, *** 97:24; ERE 
South and Islands 
L'Aquila 86:55) = ERE OLAF “RFE 00.26 99.48 
Campobasso 84.63  *** 00.78 ** 9921 * 98.49 ** 
Napoli ows)... (See 96.01 *** 95,80, TE 94.24  *** 
Potenza 88.33 *##* 97.08 #6 GARI *##* 94.39 we 
Bari 97.47 94.62 *** 95:15; FEF 94.59 IH 
Catanzaro 103.67 * 9815: -eaa 9770,» EF 97.060 AR 
Palermo 103.70 08575 RE 97.79 *** 96.83: wek 
Cagliari 102.87 98:57 u 100.15 100.23 
Obs. 3,327 66,604 66,604 66,604 
Root MSE 0.17573 0.1170 0.1017 0.1017 


Note: * 10%, ** 5%, *** 1% 
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Table 3 WCPD Estimation results: PPP ranking of regional towns by BH (1=highest prices, 20= lowest 
prices) 


aa © 53 & 58 È z fg 53 23 

2 SE ga NI 5 BO sa <a 

io) Ò Ò "i È e; 
Aosta North 2 2 12 3 9 2 3 4 6 
Torino North 8 5 17 7 10 11 6 9 2 
Genova North 1 1 4 1 15 5 20 1 7 
Milano North 4 3 15 8 11 4 5 8 0 
Trento North 3 4 9 5 2 9 7 4 7 
Venezia North 6 6 19 9 6 15 1 7 2 
Trieste North 2 9 3 Il 3 1 4 2 6 
Bologna North 3 1 13 7 4 6 T 7 0 
North 7 6 12 9 8 7 10 10 9 
Firenze Centre 20 20 20 20 20 20 2 20 9 
Ancona Centre 0 5 14 8 5 3 8 5 1 
Perugia Centre 8 8 10 6 F 13 9 0 4 
Roma Centre 6 7 7 9 2 10 8 8 8 
Centre 14 15 13 16 14 12 12 11 3 
L'Aquila South &Islands 9 0 1 6 3 8 3 6 8 
Campobasso South &Islands 1 2 6 4 1 14 0 1 9 
Napoli South &Islands 5 4 8 5 7 12 5 2 1 
Potenza South &Islands 7 3 2 3 8 18 9 3 2 
Bari South &Islands 4 7 16 2 8 19 6 6 6 
Catanzaro South &Islands 9 9 5 2 9 17 1 5 3 
Palermo South &Islands 7 8 11 0 6 16 2 9 0 
Cagliari South &Islands 5 6 18 4 4 i 4 3 0 
South and Islands 12 12 8 10 12 14 10 11 1 


Table 3 reports the position in the ranking of the PPP values for the regional chief 
towns compiled using expenditure share WCPD for the BHs belonging to “Bread 
and cereals” class. Significant variability is found in the position of the regional chief 
towns among BHs within the same class of products which may reflect different 
consumer behaviors and characteristics of modern retail chains in the Italian regions. 
Our results provide valuable insight on how to aggregate PPPs above the BH level. 
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Graphical approximation of Best Linear 
Unbiased Estimators for Extreme Value 
Distribution Parameters 


Approssimazione grafica degli stimatori BLUE dei 
parametri delle distribuzioni dei valori estremi. 


Antonio Lepore 


Abstract Graphical estimation methods play a central role in today’s software be- 
cause may allow for a more straightforward analysis of the data and interpretation 
of results also by non-statisticians. In this paper, the best unbiased graphical es- 
timators of distribution parameters, which have recently appeared in the literature 
for location-scale distributions, are conveniently approximated for the special case 
of the extreme value distributions for minima and maxima. The mean square de- 
viation and bias of the resulting parameter estimators are compared to concurrent 
ones through proper pivotal indices via Monte Carlo simulation. The proposed ap- 
proximation involves and is shown to produce also adequate results for the first two 
moments of order statistics from the standard extreme value distributions. 
Abstract I metodi di stima grafica ricoprono un ruolo centrale nei moderni stru- 
menti software, in quanto facilitano l’analisi dei dati e l’interpretazione dei risultati 
anche in contesti non statistici. In questo lavoro, gli stimatori grafici BLUE recen- 
temente proposti in letteratura per la famiglia di distribuzioni di posizione e scala 
vengono approssimati nel caso particolare delle distribuzioni dei valori estremi e 
vengono confrontati, mediante simulazione Monte Carlo, con le corrispondenti al- 
ternative proposte in letteratura tramite l’uso di opportuni indici non parametrici. 
Inoltre, la soluzione proposta rappresenta una soddisfacente approssimazione dei 
primi due momenti delle statistiche ordinate per le distribuzioni standard dei valori 
estremi. 


Key words: graphics and data visualization, linear unbiased estimators, location- 
scale distributions, extreme value distribution, probability plot. 
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1 Introduction 


In many applicative fields, practitioners are used to exploit software tools that adopt 
graphical techniques to visualize data and check the fit provided by the chosen 
model. Even if a variety of effective analytical methods is available, graphical tech- 
niques give deeper insight and visual understanding of statistical information. This 
is commonly achieved through probability plots, which report ordered observations 
of a random variable (i.e., experimental data) against the corresponding estimates 
È; of the parent cumulative distribution function (cdf) (i.e., the plotting position) on 
properly scaled axis in a linear fashion. The more the points lie on a straight line, 
the more suitable the chosen parent distribution. As is known, the latter is usually 
required to belong or to be related to the location-scale family. This assumption al- 
lows probability plots to estimate parent distribution parameters through the slope 
and the intercept of the line of best fit [11]. However, the choice of the regressand 
(i.e., response variable) for the distribution fitting methods and their corresponding 
relative accuracy are not always clear [10] and the dispute of determining a unique 
plotting position approach has given rise to recent contributions and a wide con- 
troversial discussion [1, 2, 4, 5, 6, 7, 11, 14, 15, 16, 17, 18]. In particular, Pirouzi 
Fard and Holmquist [18] consider simple approximations of variances and covari- 
ances for order statistics from the standard extreme value (EV) distribution, whereas 
Pirouzi Fard [17] provides a comparison between the ordinary least-squares (OLS) 
and the generalized least-squares (GLS) distribution fitting methods when the data 
set arises from the standard EV distribution for minima. Cook and Harris [2] find out 
in the case of the EV distribution for maxima that the classical Gringorten estima- 
tor [8] of the order statistic mean gives asymptotic values for infinite sample sizes, 
whereas they are most often improperly used for small sample sizes. Fuglem et al. 
[7] support previous work by Cunnane [3] and state that plotting position methods 
dependent on the anticipated parent distribution should be used. However, Makko- 
nen et al. [15, 16] still support the classical distribution-free approach [9]. In this 
paper, the graphical best linear unbiased estimators (BLUEs) [6], which have re- 
cently appeared in the literature, are elaborated for the EV distributions through a 
convenient approximation of the first two moments of order standard statistics and 
compared to the most popular and effective ones. 


2 Approximation of the BLUEs of Extreme Value Distribution 
Parameters via probability plots 


The EV cdf for minima (referred to as extreme value distribution in [17, 18]) and 
maxima (referred to Gumbel as in [2, 10, 11]) are, respectively, as follows 


x—a sa 


Fin (x;a,b) = 1-e°”, Fu (x;a,b) = e © ; b>O0. (1) 
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As is known, EV standard cdf’s can be obtained by setting a = 0 and b = 1 and 
have inverse functions which are infinitely differentiable. Performances of graphical 
approaches for EV distributions are influenced by the choice of the plotting position 
formula, the distribution fitting method, as well as the covariance matrix (or its 
approximation) especially if the sample size is small [10]. In general, the use of 
the ordered observations of a sample of size N, x(1),.-.,X(7),-++,X(w), as regressand 
and the mean of the standard order statistics, M(1),...,U(,---; Mw), as regressors 
achieves the best results and is mandatory for GLS estimation, which explicitly 
requires the specification of the covariance, Oq; j), between the i-th and j-th order 
statistics (1 < i < j < N ) and leads to BLUES of distribution parameters. In this 
paper the approximations suggested in [6] for U) and Oq, j) 


y pipi) 1 -10( Pil = pi) = 2p) 


1 
fyi) =! (pi) + 3A (pi 


(N+2)? 3 (N+2)? 
1 14) pU- pi) si 
SGF J2 I 
139° 03 aay 
> pi (l — Pi) ~- _ Pid pj) E È 
Sun = 0 + [0 -= 2p) 6" (pi) 6 (p;) 
(3) 


+(1-2p,)G!® (p)G (pi) + 5011 — PG" (p) G (09) 


1 _ = 1 = = 
+ 5pi(1— pj) G "© (p;))G (vi) + 5p: (1-p;) @' (p) G" (pj) 


are conveniently elaborated for the EV distribution for minima (resp. maxima), 
where G (x) = Fp (x;0, 1) (resp. G(x) = Fry (x30, 1) ), pi = i/ (N + 1), and G7! (x) 
is the k-th derivative of the inverse function G7! (x). However, simple closed forms 
are available for 41) and 01,1), in the case of the EV distribution for minima 


Hay = —yY-InN, 0(1,1) = 1° /6 (4) 


and for Hy) and 0(y y), in the case of the EV distribution for maxima 


Mw) =Y+InN, Own) = T°/6 (5) 


where y is the Euler’s constant. Therefore, expressions (4) and (5) can be more 
opportunely utilized in this area in place of the general approximation formulas (2) 
and (3). Then, the GLS regression of Ñq on the sample observations through the 
covariance approximation 6; j) lead to graphical estimators for a and b, that are 
not unbiased because of the approximations. Hence, based on the results drawn in 
[10], it can be of interest to compare the latter approach with the most effective ones 
among those mentioned in the introduction and summarized in the first two rows 
of Table 1, namely Pirouzi Fard (PF) [17] and Hong and Li (HL) [11]. In general, 
each approximation Ñ; is associated with a plotting position fÊ =G! (fy) and 
vice versa. The last two rows of Table 1 report the Cook and Harris (CH) [2] and 
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Table 1 Summary of the analysed probability plots (1 < i < j < N). The correction factors Ywx 
(k= 1,...,5) are defined as in [11]. 


Fi Si.) 
pet i n?’ /6 i=j=1 
l-e N i= P Asf -1 
PF ; (i-0.469)(N+0.831—i)'(/+0.073) g 
pee elsewhere 1 Ear) in( NEORLF) elsewhere 
RO weosse ) pos 
i ev EN 7/6 JEN 
HL $ _(-0.37+0.232/V™) sisewhere SA iw elsewhere 
(N+0.144+0.232/VN) <" (42-12) G- 8) (in E) in (er) 
CH (i-0.439+0.466/In(N)) 
i 
GU NFI = 


the classical Gumbel (GU) [9] plotting positions that rely instead on the use of the 
OLS method. Note that PF only applies to the EV distribution for minima, whereas 
HL and CH only apply to that for maxima. 


3 Simulation Study and Results 


A simulation study is carried out by drawing M = 10° pseudo-random samples from 
the EV distributions for minima and maxima at sample sizes N = 5 and N = 30 to 
compare 


(i) the goodness of the approximations used of u; and (when applicable) of Oq; j), 
(ii) the bias and the efficiency of graphical estimators for a and b, 


corresponding to the different approaches reported in Table 1 and that proposed. 
Slightly differently from [10, 11, 18], the following root mean square error (RMSE) 
and maximum absolute deviation (MAD) indices are utilized to compare (i) 


RMSE = v Li (46 — R)?/N, MAD= toe © 
Note that they are not pivotal (parameter-free), then may vary according to the actual 
distribution parameters. Therefore, the latter are set to standard values without the 
RMSE being undetermined when any H) = 0 as in [10]. The exact evaluation of 
Ha and G(;, DI in (6) is obtained using numerical integration [13]. Moreover, the 
following indices, namely the pivotal root deviation (PRD) and the pivotal absolute 
bias (PAB) of estimators â and È are introduced 


PRD (4) = \/ E{(d—a)"}/b?, PRD(6)=E{(b-b)}/6 © 


PAB (â) = |E{@}—a|/b, PAB (b) = |E{b} —b|/b (8) 
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in order to compare (ii). It is trivial to show that (7) and (8) are pivotals (see, e.g., 
[6, 12]) and therefore, the obtained results stand for whatever parameter. The lower 
the RMSE and the MAD, the better the proposed approximation of H; and Oʻ;), 
respectively. Table 2 reports RMSE and MAD achieved by the different approaches 
reported in Table 1 and the proposed one, whereas Table 3 reports PRD and PAB 
of the estimators â and È. As anticipated, note that CH and GU approaches do not 
involve the approximation of G(; j), thus do not apply for MAD. 


4 Conclusions 


Table 2 clearly shows that the proposed approximations for the mean and the covari- 
ances of the order statistics from the EV distributions achieve the best performances 
at each considered sample size (N = 5,30) both in terms of RMSE and MAD. Table 
3 confirms that the corresponding graphical estimators of distribution parameters 
achieve the smallest bias (PAB) and the highest efficiency (i.e., the smallest PRD) 


Table 2 RMSE and MAD achieved by approaches reported in Table 1 and that proposed for EV 
distributions at different sample sizes — bold text highlights the smallest value of each column. 


EV distribution for minima EV distribution for maxima 


RMSE MAD RMSE MAD 
N=5 N=30 N=5 N=30 N=5 N=30 N=5 N=30 


Proposed 0.00355 0.00057 0.01453 0.00193 0.00355 0.00057 0.01453 0.00193 


PF - - - - 0.01564 0.00288 0.02531 0.00324 
HL 0.00756 0.00282 0.24872 0.32144  - - - - 
CH 0.04349 0.00743  - - - - - - 
GU 0.23592 0.12330 - - 0.23592 0.12332 - - 


Table 3 PRD and PAB of â and Ô achieved by approaches reported in Table 1 and that proposed for 
EV distributions at different sample sizes — bold text highlights the smallest value of each column. 


EV distribution for minima EV distribution for maxima 


PRD (â) PAB (â) PRD (D) PAB (b) PRD (â) PAB (â) PRD (b) PAB (Ô) 


Proposed 0.48020 0.00191 0.40820 0.00081 0.48020 0.00191 0.40820 0.00081 


PF - - - - 0.48176 0.01852 0.40403 0.01018 
N=5 HL 0.48009 0.00836 0.41496 0.00425 - - - - 

CH 0.48095 0.04568 0.46173 0.00569 - - - - 

GU 0.48358 0.00383 0.61444 0.24902 0.48358 0.00383 0.61444 0.24902 


Proposed 0.19266 0.00079 0.14686 0.00009 0.19266 0.00079 0.14686 0.00009 


PF - - - - 0.19273 0.00216 0.14714 0.00173 
N = 30 HL 0.19371 0.00165 0.14949 0.00164 - - - - 
CH 0.19684 0.00462 0.18467 0.00148 - - 


GU 0.19481 0.00640 0.21633 0.08979 0.19481 0.00640 0.21633 0.08979 
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even when plugging in the proposed approximations for large sample sizes (N = 30). 
However, as expected, some rather biased estimators can be slightly more efficient 
at small sample sizes (V = 5), namely PF and HL. According to [10], note that 
graphical estimators of distribution parameters that rely on the OLS instead of the 
GLS estimation method, namely CH and GU, are always the least efficient. Hence, 
Makkonen’s claims in the plotting position controversy mentioned in the introduc- 
tion cannot be supported. In the view of these results, the proposed approximation 
for EV distributions allows practitioners not to drastically abandon classical graph- 
ical methods and opt out of more efficient analytical solutions. 
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Monitoring ship performance via multi-way 
partial least-squares analysis of functional data 


Monitoraggio delle prestazioni di una nave mediante 
analisi multi-way partial least squares di dati funzionali 


Antonio Lepore, Biagio Palumbo and Christian Capezza 


Abstract The multi-sensor systems installed on board of modern ships provide 
massive amounts of data that require opportune multivariate methods for continuous 
performance monitoring during voyages. In this paper, functional data are obtained 
from variables that describe operating conditions of a Ro-Pax cruise ship owned by 
the Grimaldi Group and are analysed via multi-way partial least-squares regression 
of the fuel consumption per hour. The proposed procedure is shown to well predict 
and monitor ship performance and to indicate if and when an anomaly may occur in 
ship operating conditions throughout each voyage. 

Abstract / sistemi di acquisizione dati installati a bordo delle moderne navi 
generano un’enorme quantita di dati che rende necessario lo sviluppo di opportuni 
metodi multivariati per il monitoraggio delle prestazioni durante la navigazione. 
Nel presente lavoro, viene effettuata un’analisi dei dati funzionali che descrivono le 
condizioni operative di viaggio di una nave da carico e passeggeri di proprietà 
della società armatoriale italiana Grimaldi Group. La procedura proposta in questo 
articolo consente, mediante regressione multi-way partial least squares, la 
previsione e il monitoraggio continuo delle prestazioni della nave ed è in grado di 
supportare l’individuazione delle anomalie e dell’istante in cui esse si presentano 
durante un viaggio. 
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fuel consumption monitoring, multivariate control chart 
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1 Introduction 


Nowadays, thanks to real-time multi-sensor systems installed on board, modern ships 
are able to continuously measure and store operating data overload that requires 
opportune multivariate methods for monitoring ship performance. Monitoring fuel 
consumption of a ship usually has been limited to single measurements of variables 
for each voyage, or to continuously monitor a single variable throughout the entire 
voyage, generally the speed over ground (SOG). Even though functional data 
analysis is applied in several subject areas [1,2], it has never been implemented in 
the maritime field. In this paper, functional data are obtained from variables that 
describe ship operating conditions and are used to apply multi-way partial least- 
squares (MPLS) [3,4] for monitoring ship performance. Based on trajectories of 
different ship operating conditions, squared prediction error charts are used to 
monitor anomalies in ship operating conditions, whereas prediction error chart 
monitors the fuel consumption per hour (FCPH). 


2 The procedure 


Time is the functional domain used in most of the functional data analysis 
applications. However, on a given ship route, travel duration varies significantly 
from voyage to voyage. Therefore, a more appropriate domain needs to be chosen to 
allow comparing navigation variable measurements over different replications, 1.e., 
different voyages. In this paper, percentage of total distance travelled at each voyage 
by the ship is suitably chosen as functional domain. At a given domain point, the 
ship is almost in the same position over different replications of a given route and its 
operating conditions are reasonably expected to be similar when no anomalies in 
ship performance occurred. Discrete measured values of a navigation variable at 
different domain points for each voyage are then converted to functional data. 

The main underlying idea of the proposed procedure is to get a three-dimensional 
array X by evaluating ship operational reference (functional) data at given domain 
point with the following three dimensions: number of replications /, number of 
variables J , and number of evaluation points K . Thus, a MPLS regression model 
can be suitably built on the ship FCPH at each voyage as the scalar response 
variable, which is organized into the (7x1) vector y. Functional data analysis is 


found useful to obtain instantaneous information not only about the SOG, which, as 
is known, represents the most significant variable determining FCPH [5], but also 
about its derivative, i.e., acceleration (that can be used as additional predictor). 
Furthermore, discrete operating conditions are generally available (e.g., 
departure/arrival operating conditions, route type) for each voyage and can be stored 
in a (JxM) matrix Z. Data are mean centered and scaled prior to perform the 
analysis. 
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MPLS is thus applied by unfolding the array X into a large (J xJK ) matrix Ñ 
and considering the (J x(JK +M)) matrix X=[X Z], which can be decomposed 


via the partial least-squares (PLS) method into a smaller number of R orthogonal 
score vectors t,,...,t,, arranged in a (J xR) matrix T. The latter can be eventually 


used as regressor for y as follows 

X=TP'+E; y=Tq+f, (1) 
where P is the ((JK +M)xR) matrix of the X-loadings, q is the (Rx1) vector of 
y-loadings, and E (/x(JK+M)) and f (Ixl) are residual matrices. The matrix 
T is given by [6,7] 


T=XW(P"W)", (2) 
where W is the ( (JK +M)xR) matrix of the X-weights. Forthcoming voyages can 
then be monitored by specializing the squared prediction error statistic SPE, [3] for 
residuals in the predictor variable space at each voyage to a single instantaneous 
evaluation point k (i.e. a given percentage of distance travelled) as 


SPE, =È poya, where the (1x(JK+M)) vector e contains the 


corresponding X-residuals. SPE, represents the perpendicular distance of the 
instantaneous ship operating condition measurements from the reduced predictor 
variable space obtained based on the reference data. Control limits for both SPE, 
and SPE, statistics are given by [8]. Detailed information can be obtained about 
plausible causes of anomalies by interrogating the MPLS model. While SPE, 
statistic is able to clearly detect problems at a specific point k, one can examine 
contribution of the j-th individual variable to the SPE, statistic through 


SPE, , =>. eG +c’. 


If a forthcoming voyage shows no anomalies, i.e., the monitoring statistics do not 
exceed control limits, approximate prediction intervals for the future observation of 


FCPH y can be calculated through the limits P+1t,_p.,.6Vl+t.,(T'T)t,., [3], 


I 
where t is obtained as t? = x" W(P'W) from the future x-observations 


new new new 


Xw and Equation (2), $=t7,9, G=f"f/(7-1) and trian is the 1000/2 


new 


percentile of a Student’s distribution with (J —R-—1) degrees of freedom. 
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3 Application 


The proposed procedure has been applied to real operational data acquired on board 
of a Ro-Pax ship operating in the Mediterranean Sea, owned by the Grimaldi Group. 
For each voyage J =7 functional variables have been considered as predictors of 
the FCPH: 
SOG [kn]; 
acceleration [kn/s]; 
power difference between port and starboard propeller shafts [AW]; 
power difference between port and starboard shaft generators; 
longitudinal wind [kn]; 
side wind [kn]; 

7. distance from the mean route [NM]. 
The last variable takes into account the path differences between voyages of the 
same route type. Each functional variable has been evaluated in K =100 equally 
spaced domain points. The matrix Z contains two indicator variables that 
distinguish the three route types that the ship sails. MPLS has been then applied to a 
set of J =192 reference voyages. A single run of 10-fold cross validation procedure 
based on PRESS sstatistic [9] has been carried out and selected R=4 latent 
variables. The coefficient of determination is equal to 0.93 and confirms the model is 
able to adequately predict the FCPH at each voyage. Ship performance has been then 
monitored on 51 successive voyages. Figure 1 shows the SPE, at each voyage and 


IT IE 


highlights unusual operating conditions for voyages 22, 23, 26, 27, 30, 39 and 44. 
These voyages are further individually examined using the SPE, control chart. As 
an example, the SPE, statistic for voyage 14 reported in Figure 2a does not show 
unusual variations throughout the voyage. Whereas it is clearly above the 95% 
control limit for voyage 39, as shown in Figure 2b. The anomalous voyages can be 


checked against the reference model to determine the reason for their difference. 
This can be investigated by using the contribution plots. 


SPEx 


T T T T T T 
> a > Q 


Voyage Number 


Figure 1: Monitoring charts for SPE, statistics with 95% and 99% control limits (dashed and solid line) 


for 51 new voyages. 
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Figure 3 displays the contribution of each variable to the SPE, statistic for voyage 


39 and indicates the variable 6 (distance from the mean route) as the main variable 
responsible for the out-of-control observation. For voyages with SPE, and SPE, 


statistics in control, FCPH can be monitored through the prediction error chart 
illustrated in Figure 4. In this chart, voyages that fall outside the prediction limits 
require further investigation on those variables that have not been considered in the 
MPLS model. 


SPEk 
SPEk 


0 25 50 75 100 0 25 50 75 100 
Percent miles Percent miles 
(a) (b) 


Figure 2: Monitoring charts for SPE, statistics with 95% and 99% control limits (dashed and solid line) 


for a new regular voyage (voyage 14) with no anomalies (a) and for a new voyage (voyage 39) where 
problem is clearly identified (b). 


Contribution to SPEx 


Variable 


Figure 3: Contribution of variables to SPE, statistic for voyage 39 in Figure 1b. 


Prediction Error [t/h] 


Voyage 


Figure 4: Monitoring FCPH for new voyages for which squared prediction error statistic shows no 
anomalies through prediction error chart, with 95% prediction limits. 
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4 Conclusion 


In the shipping industry, there is a lack of methods that allow multivariate continuous 
monitoring during voyages. Functional data give instantaneous information on ship 
operating conditions and can be used to build linear models via multi-way partial 
least squares to monitor ship performance and predict fuel consumption per hour. 
The application illustrated in this paper shows that the proposed procedure is able to 
furnish adequate predictions and to indicate if and when anomalies occur. The 
squared prediction error statistic evaluated at a single domain point gives clear 
indications in this regard. This would have not been feasible through statistical 
models built using a single variable observation per each voyage. Functional data 
analysis could be exploited for numerous applications, such as the crucial theme of 
developing predictive maintenance techniques on ship engines and identifying types 
of faults in order to provide early warnings. 
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Dynamic profiling of banking customers: a 
pseudo-panel study 


Segmentazione dinamica di clienti bancari: uno studio 
basato su indagini ripetute 


Caterina Liberati, Lisa Crosato, Paolo Mariani and Biancamaria Zavanella 


Abstract The analysis of the evolution of satisfaction in business context is usually 
based on pseudo panels studies, because they are less costly and easy to build with 
the available data. As in the cross-section case, detailed information about customers 
are collected at each time point, but the dynamic comparison generally involves few 
temporal lags (due to short life time of products and services). Accordingly, in our 
paper we apply the Dual Multiple Factor Analysis. Such a technique allows the 
synthesis of the multivariate tables and their visualization on a common space that 
sheds light on customers’ trajectories of satisfaction. A real case study of an Italian 
bank is illustrated. 

Abstract L’analisi dell'evoluzione della soddisfazione nel contesto di business, è di 
solito basato su studi pseudo-panel, perché sono meno costosi e facili da costruire 
con i dati disponibili. Come in casi cross-section, informazioni dettagliate sui clienti 
vengono raccolte in ogni istante temporale, ma il confronto dinamico avviene gen- 
eralmente considerando soli pochi ritardi temporali (a causa del breve ciclo di vita 
di prodotti e servizi). Di conseguenza, nel nostro contributo proponiamo l’utilizzo 
dell’analisi fattoriale multipla duale. Tale tecnica permette la sintesi delle tabelle 
multivariate e la visualizzazione delle stesse su uno spazio comune che fa luce su 
traiettorie di soddisfazione dei clienti. I vantaggi della tecnica vengono illustrati in 
un caso di studio relativo ad una banca italiana. 
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Key words: Customers Profiling, Customer Satisfaction Surveys, Dual Multiple 
Factor, Pseudo-Panels 


1 Introduction 


To date, the study of Customer Satisfaction has dominated marketing behavioural 
literature.[5] Despite the strong recognition that consumer behaviour should be 
viewed from a dynamic perspective, only a residual percentage of the studies pub- 
lished in marketing has addressed the problem in this manner.[12] [10] [9] [8] The 
dearth of panel studies appears to be largely a consequence of costs to the company 
and difficulty in obtaining longitudinal data sets and/or maintaining the sample over 
time[7], maybe due to little incentive to build databases of historical performance 
for products and services. 

Given the deficiencies of cross-sectional data and the problems associated with 
collecting longitudinal panel data, one practical solution is to exploit, as much as 
possible, all of the information already available in various cross-sectional data 
sources.[3] 

The econometric literature proposes a way to perform such matching: the collec- 
tion of pseudo-panel data, that makes it possible to monitor gross change utilising 
a time series of cross-sectional data. At this regards, Ref. 1 introduced the use of 
cohorts to estimate a fixed effects model from repeated cross-sections. The benefits 
of such a procedure are several, from a decrease in attrition, to a drop in individual 
measurement errors. Although the econometric approach has provided a valuable 
contribution to studying pseudo panel data, such a strategy seems inadequate for the 
treatment of marketing surveys, where individual level changes must be monitored. 

In an attempt to approach the Customer Satisfaction study from a dynamic per- 
spective based on pseudo panel surveys, this work proposes the usage of Multiple 
Factor Analysis (MFA). MFA is an extension of the Principal Component Analysis 
(PCA), tailored to handle multiple data tables that measure sets of variables col- 
lected over the same observations, or, alternatively, (in the Dual MFA) multiple data 
tables where the same variables are measured over different sets of observations. 
The advantages of such a technique are several and ranging from full information 
employment (in terms of instances and variables), to synthesising the dimension- 
ality of the tables and easily visualising points across time. Indeed, once all of the 
instances have been embedded into the common factorial plan, a post-hoc stratifica- 
tion can be performed to reduce the number of instances into a manageable number 
of profiles that are mutually exclusive and share well-defined characteristics.[4] [11] 


Dynamic profiling of banking customers: a pseudo-panel study 603 


2 Modeling Data Over Time 


The analysis of several sets of individuals described by a same set of variables is a 
problem frequently encountered, not only in marketing. Dual Multiple Factor Anal- 
ysis (DMFA) answers exactly such task.[2] The general idea behind DMFA is to 
normalize each of the datasets and then to combine these data tables into a common 
representation of the variables called the compromise map.[6] Let’s denote with X 
a N x K matrix composed by column-wise juxtaposition of L sub-matrixes, each of 
them collecting on the same set of variables but different observations. The math- 
ematical formulation of DMFA can be described into two steps. In the first one a 
grand matrix X is obtained, juxtapositioning the standardized Xx by column and 
weighting them with the first eigenvalue coming from separate PCAs on Xy. In the 
second step, we performed a Principal Component Analysis of the grand matrix. 


(X'DX) = ATA’ (1) 


where D is the (N x N) diagonal matrix (metric) whose terms are the masses 
associated to the observations, I is a K x K orthonormal basis and A is a K x K 
diagonal matrix of eigenvalues. 

The spectral decomposition theorem ensures the best reconstruction in terms of 
least squares of the weighted correlation matrix (X'DX); the solution provides indi- 
viduals’ factor scores of the total matrix X, which represent a compromise of the K 
sub-matrices: 

F=XA°U (2) 


The data analysed in this study aim to monitor several aspects related to customer 
satisfaction across three years (2010-2012). It collects clients’ appreciations about: 


e Banking touch points (Personnel (a), ATM (b), Internet-banking (c)) 

e Imagine of the credit institute (Prestige (e), Innovation (f), Honesty (g), Trust (h)) 

e Proxy of customers engagement (Probability to recommend the bank to some- 
one(d), Interest for the customers (i), Overall Satisfaction (1)) 


The total number of the collected instances was 6193, summarizing, respectively, 
the 2068 (2010), 2058 (2011), 2067 (2012) observations over three years. 

Our data matrix X (6193 x 10) was obtained by summing up (by column) the three 
data tables X = [X2010;X2011;X2012]. The two-step procedure was employed on the 
matrices under study: first, X2010, X2011,X2012 were centered and standardised and 
a separate PCA was run on each of the tables to obtain the weights to balance the 
within-groups inertia. Second, a PCA was performed on the grand matrix obtained. 

The solution provided by the spectral decomposition uncovers a peculiar con- 
figuration of the variables onto the principal compromise plane (Fig. 1). Accord- 
ingly we named fi as standardized-customised banking service and f2 as imagined- 
experienced bank. 

Instances can also be projected on the compromise plane. Due to the large num- 
ber of individuals composing the sample, it is very difficult to visualise the dynamic 
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Fig. 1 Variables projection onto the principal plane 


paths of each subject; therefore, a segmentation was performed to profile the long- 
term behaviours of customers’ profiles. We focus our attention on those groups that 
were selected according to the Italian bank’s interests, but the analysis can easily be 
replicated for any other group. 

We compare the trajectories of clients distinguished by socio-demographic char- 
acteristics, since such variables usually play an important role in the assessment 
formulation. Visual inspection of the graphical representations in figure 2, which de- 
pict average positions and inner variability of the profiles over time, reveals different 
evolutions of the monitored groups. In our case, female customers move along the 
positive side of the first axis, showing a high appreciation for a customised banking 
service (Fig. 2 left panel). On the contrary, males move exactly on the opposite di- 
rection, seeming more involved with a standardised assistance. Also the relationship 
with the bank has different connotations when studied by gender: females prefer to 
experience services provided by the financial institution while males take in high 
consideration the bank’s image. Both tendency paths highlight an involution in the 
temporal trends, probably underlying a lack of specific actions/stimuli per gender. 

A second comparison has been performed that contrasts customers’ behaviours 
distinguished by different educational levels (Fig. 2 right panel). This time, trajecto- 
ries appear different for shapes and lengths. Graduates, lying on the negative side of 
the first axis over the three years, show a high appreciation for remote touch points 
while middle school graduates have more fuzzy evaluations. On the contrary, poorly 
educated profiles (as middle school or elementary school graduates) seem to prefer 
customised assistance. 
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Fig. 2 Dynamic behaviours: trajectories per gender (left panel); trajectories per educational levels 
(right panel) 


3 Conclusion 


This paper presents a new approach to Customer Satisfaction management. The 
idea originated from the necessity of a longitudinal perspective in the assessment of 
Customer Satisfaction, also expressed by several contributions in behavioral studies. 
Today, this task is even more important because information is available and inex- 
pensive. The strength of our approach is that it is model-free, so it can be applied 
to every data table’s comparison/visualisation. In reference to the real case study 
illustrated in the paper, the evidence found is quite interesting and can easily be 
interpreted in terms of management implications. The compromise factorial plane 
obtained allows to distinguish the profiles of the clients who prefer remote services 
from those who are involved with a customised assistance. It also uncovers different 
types of relationship between customers and the bank: such evidences, if monitored 
over time, can help the management to better respond to clients’ demands. 
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A comparison between seasonality indices 
deployed in evaluating unimodal and bimodal 
patterns 


Un confronto tra indici di stagionalità utilizzati nella 
misura di pattern unimodali e bimodali 


Giovanni L. Lo Magno, Mauro Ferrante, Stefano De Cantis 


Abstract This paper will discuss a recently proposed index for measuring season- 
ality, which is based on the solution of the well-known transportation problem. A 
specific characterization of the cost matrix will permit the taking into account of 
the cyclical structure of time periods, which characterizes the phenomenon under 
observation. Various features of the proposed index will be evaluated by comparing 
it with other indices which are commonly used in the measurement of seasonality, 
such as the Gini concentration index. Given the wide range of disciplines with an 
interest in the analysis of seasonal phenomena, the approach proposed may be of 
wide interest. 

Abstract II presente lavoro discute un indice recentemente proposto per la misura 
della concentrazione stagionale e basato sulla soluzione del problema del trasporto. 
Una specifica caratterizzazione della matrice dei costi consente di tenere in con- 
siderazione la struttura ciclica dei periodi che caratterizza il fenomeno osservato. 
Diverse caratteristiche dell’indice proposto sono valutate confrontandolo con altri 
indici che sono comunemente utilizzati per la misura della stagionalità, come ad 
esempio l’indice di concentrazione di Gini. Considerata la varietà di ambiti inter- 
essati allo studio della stagionalità, l'approccio proposto può essere d’interesse da 
diverse prospettive. 


Key words: Seasonal amplitude, transportation problem, concentration index 
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1 Introduction 


The notion of seasonality is very simple and very complex: referring to the former, 
it is a feature of many natural and human phenomena. The differing intensity of 
solar rays determines numerous effects on the environment, such as varying lev- 
els of humidity and temperature, but also on animal behaviour and human habits 
[2]. With reference to the latter, the notion of seasonality is an extremely complex 
concept, whose measurement and analysis is a challenge. Indeed, having reviewed 
the main indices used in different study contexts, we observed a lack of appropri- 
ate measures relating to seasonality, wich are capable of taking the cyclical struc- 
ture of time periods into account. That is, the majority of indices currently used in 
measuring seasonality (e.g. the Gini concentration index, the coefficient of seasonal 
variation or the Theil index) do not take into account the natural ordering of time 
periods (e.g. months). Subsequently, given a pattern in which the total amount of the 
phenomenon of interest is concentrated in two consecutive months (e.g. January and 
February) and another pattern in which the phenomenon is concentrated in two very 
distant months (e.g. January and August), the currently-used indices would evaluate 
the two patterns as the same. One of the aims of the seasonality index discussed in 
this paper is to overcome this issue. Thus, after a discussion of a recently-proposed 
index for measuring seasonality, we will evaluate some of its properties in relation 
to how it behaves in evaluating unimodal and bimodal distributions. 


2 A new index for measuring seasonality and its main properties 


The approach we would like to propose for measuring seasonality is based on the 
solution of an appropriately defined transportation problem [1]. The transportation 
problem is a well-known, linear, minimization problem in which the goal is to min- 
imize the cost of transferring units from a set of warehouses to a set of customers, 
satisfying the constraints given by the available resources and the requested de- 
mands. 

We can define a seasonal pattern as the vector P = (y1, y2, ..., yr), with obser- 
vations for time periods from 1 to T, such that y;, fort = 1, 2, ..., T, is non negative. 
The total amount of the observed phenomenon is Y = ea y, and the average value 
is Y/T. Furthermore, let o/ = {t : y > Y/T} be the set of time periods for which 
the observed value is over the average; each of these time periods has a surplus 
at = yt —Y/T. Similarly, let Z = {t : yı < Y/T} be the set of time periods with 
observed values under the average, each with deficit b; = Y /T — yr. 

In order to eliminate seasonality, the 7-dimensional pattern (4, f, dig z) can 
be obtained by transferring units from time periods in æ to time periods in %. 
The amount which is transferred from time period i € 2 to time period j € 4 is 
xij- We can suppose that transferring one unit from i to j has a cost c;;, thus the 
cost of all the transfers is c = View LjegCijXij. The minimum cost c* which is 
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required to eliminate seasonality, through the aforementioned transfers, is the cost 
corresponding to the optimal solution to the following transportation problem: 


min ¢ = View Lies CijXij 


S.t.: 

Vicwrtip=u, View (1) 
View xij = bj, VIEB 

xij 29, Vie A, FEB 


where the first set of constraints ensures that the amount a; is transferred from each 
time period i € æ; the second set of constraints ensures that each time period j € 2 
receives the amount b;; and the third set of constraints ensures that each transfer 
is non-negative. The minimum cost c* corresponding to the optimal solution to the 
transportation problem (1) is our absolute measure for seasonality, namely 


S(P) =" (2) 


In [3] we demonstrated that the maximum value of S(P), holding Y constant, is: 


y T 
Smax(P) = T year L Cij (3) 
This result permits us to construct the relative seasonality index Sp which is bounded 
in the interval [0, 1]: 


se(P) = POL (4) 
Smax (P) 

We recognize that the relation between time periods is cyclical. Time periods 
can be thought of as collocated in a circumference: the shorter arc with i and j as 
extremes is what is termed cyclical distance, and this is the unitary cost we consider 
for a transfer from i to j, namely: 


T 
sù sec dite 
li- j| if li-j]}<35 


T-|i-j| otherwise 


cij = 


(5) 


Different unitary costs may be employed in the defining our seasonality index; 
however, when those costs defined in (5) are adopted, the Sg index has three impor- 
tant properties, which we deem desirable for measuring seasonality: 


e Scale invariance: Sp(P) = Sp(AP), with A > 0. 

e Rotation invariance: Sr(y1, y2,---, YT) = SR(yT, Y1, Y2; +++; YT—1) 

e Sensitivity to permutations: generally, and excluding rotations (see the “rotation 
invariance” property), the permutations of a pattern are evaluated in a different 
way (the Gini index is permutation-invariant, thus it can not capture, for example, 
an increase in seasonality which is consequent on a reduction in the distance 
between two relevant modes). 


610 Giovanni L. Lo Magno, Mauro Ferrante, Stefano De Cantis 


3 Comparing seasonality indices 


In this section we will compare: the Gini index Gc (normalized in the [0, 1] interval), 
the precipitation concentration index PCD [5] and the relative seasonality index Sr, 
when evaluating unimodal and bimodal patterns. Unimodal and bimodal patterns, 
obtained by adapting von Mises distributions to the discrete case, were used in these 
comparisons. 

The von Mises distribution [4] is a circular, continuous, unimodal and symmetric 
distribution, with support in the [0,27] interval. Its shape depends on two parame- 
ters: the first is the expected value u, which corresponds to the mode; the second is 
K = 0 and it affects the degree of concentration. The higher is x, the greater is the 
concentration around the mode; if K = 0, then the von Mises distribution coincides 
with the circular uniform distribution. 

In order to construct unimodal patterns of size T, we discretized von Mises dis- 
tributions by partitioning their supports into T equal intervals of size 27/7 and 
then calculating the probability related to each interval. Thus, the value y; in each 
resulting pattern is the probability which is related to the i-th interval of the von 
Mises’ support and, consequently, Y = 1. In this study 241 von Mises distributions 
were considered for the unimodal patterns; these distributions have the same u, but 
k=0, 0.1, 0.2, ..., 24. The parameter u was set to 7/7, namely the midpoint of the 
first interval. Fig. 1 shows the results of all the indices which were applied to the 
241 unimodal distributions under consideration. In order to clarify how the shape 
of the distribution changes as x increases, radar charts for the distributions with 
K = 0,4,8,12,16,20,24 have been included in the lower part of Fig. 1. It can be 
observed that all the indices increase as «K increases, thus they correctly capture the 
concentration parameter K. Furthermore, the relation between the three indices is 
Sr < Gc < PCD, with the only exception for the distribution with x = 0, for which 
all the indices have the value 0. 

In order to observe the behaviour of the indices when they evaluate bimodal 
patterns, size 12 bimodal patterns were constructed as discrete versions of a 50% 
mixture of two von Mises distributions, both with the same K parameter value. The 
distance d between the two modes varied between 0 (the unimodal case) and 6 (the 
maximum possible distance between two time periods). 1 and 5 were considered as 
two extreme values for x, which corresponded to a low and a high concentration 
case around the two modes respectively. The indices values for the patterns when 
K = | are plotted on Fig. 2 and those for the patterns when « = 5 are displayed in 
Fig. 3. 

Fig. 2 demonstrates that, for bimodal patterns where x = 1, the indices decrease 
in value while d increases; furthermore they exhibit similar values. However, Fig. 3 
also shows that the indices decrease in value while d increases, but they decrease at 
a quite different rate. To understand why this difference, it has to be noted that if x 
is very high then there is a high concentration around the two modes, thus patterns 
where d > 2 can be approximately expressed as P = (yı = 0.5,0,...,0, y144 = 
0.5, 0, ..., 0). Thus, for d > 2, all the patterns can be roughly considered as permu- 
tations. As the Gini index is relatively insensitive to permutations, its value is quite 


A comparison between seasonality indices 611 


o 
a 
LLILLILLILILILLI 


Fig. 1 Relative seasonality index Sp, corrected Gini index Gc and precipitation concentration de- 
gree index PCD, all calculated for discretized von Mises distributions with various values of « and 
T = 12. Radar charts are reported for the distributions with x = 0,4,8, 12, 16,20,24. The regular 
dodecagon in each radar chart corresponds to the uniform pattern. 


stable for patterns where d > 2. This is not a desirable behaviour because a sea- 
sonality index should significantly decrease when the distance between two highly 
concentrated modes increases. In contrast, the PCD index clearly reveals decreasing 
values but it is zero when d = 6. This occurs because PCD equals zero for pat- 
terns where the first half of the pattern is the same as the second, like the pattern 
(1, 3, 5, 1, 3,5). A seasonality index should not return zero for patterns like these. 
Thus, the only index which shows a desirable behaviour is Sg, which decreases at 
an appreciable rate and does not reach zero when d = 6. 
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Fig. 2 Relative seasonality index Sp, corrected Gini index Gc and precipitation concentration de- 
gree index PCD, all calculated for discretized, bimodal von Mises distributions with T = 12, x = 1 
and values of the distance between the two modes d = 0, 1,...,6. Radar charts are reported for the 
distributions with x = 0,4, 8,12, 16,20,24. The regular dodecagon in each radar chart corresponds 
to the uniform pattern. 
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Fig. 3 Relative seasonality index Sp, corrected Gini index Gc and precipitation concentration de- 
gree index PCD, all calculated for discretized, bimodal von Mises distributions with T = 12, «=5 
and values of the distance between the two modes d = 0, 1,...,6. Radar charts are reported for the 
distributions with x = 0,4,8, 12, 16,20,24. The regular dodecagon in each radar chart corresponds 
to the uniform pattern. 


4 Concluding remarks 


The authors of this paper contend that the challenge of measuring seasonality has 
not received adequate attention in the literature. For example, the Gini index does 
not take into account the natural ordering of time periods. In addition to permitting 
a specification of the cost matrix which takes into account for the cyclical structure 
of time periods, our seasonality index performed well when applied to unimodal 
and bimodal patterns. Although a set of desirable properties has been highlighted 
in this paper, we hope that new realated issues will come to light as a result of 
this research. Given these challenges and considering the interdisciplinary nature of 
seasonal phenomena, the topic of measuring seasonality merits greater attention and 
from a wider range of points of view. 
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Three-way Correspondence Analysis for 
Ordinal-Nominal Variables 


L’Analisi delle Corrispondenze a Tre-vie per Variabili 
Ordinali-Nominali 


Rosaria Lombardo and Eric J Beh 


Abstract This paper presents some variants of three-way polynomial correspon- 
dence analysis to analyse associations in three-way contingency tables that are con- 
structed from ordinal and nominal variables. Historically, three-way correspondence 
analysis has been used for this purpose without regard to the ordinal structure of the 
variables. Recently, Lombardo et al.(2017) proposed an alternate orthogonal basis 
of Emerson’s polynomials for modelling interactions in three-way contingency ta- 
bles. Here, we propose the hybrid decomposition for modelling cases where not all 
variables are ordered. 

Abstract In questo articolo si propone lo studio di alcune varianti dell’analisi delle 
corrispondenze a tre-vie con polinomi ortogonali, per analizzare le interazioni tra 
variabili ordinali e nominali. In letteratura, l’analisi delle corrispondenze a tre-vie 
é utilizzata per lo studio della dipendenza tra le variabili qualitative senza tener 
conto della natura ordinale delle categorie delle variabili. Recentemente, per tener 
conto di tale caratteristica, Lombardo et al. (2017) hanno proposto di considerare i 
polinomi ortogonali di Emerson come una base ortonormale alternativa per analiz- 
zare le interazioni nelle tabelle di contingenza a tre-vie. In presenza di variabili 
categoriche miste (nominali-ordinali), modelli ibridi di decomposizione saranno 
considerati. 


Key words: Three-way Correspondence Analysis, Ordinal and Nominal Categori- 
cal Variables, Tucker3 Decomposition, Trivariate Moment Decomposition, Hrybid 
Decomposition. 
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1 Introduction 


In this paper we aim to present some variants of ordered three-way correspondence 
analysis recently proposed in the literature (Lombardo et al. 2017) for analysing 
nominal-ordered variables. The objective of ordered three-way correspondence 
analysis is to gain insight into the symmetric association among the three ordered 
variables that make up a contingency table (Lombardo et al. 2017). This insight is in 
addition to what the classical approaches reveal when using methods of generalized 
singular value decomposition such as the Tucker3 or the PARAFAC decompositionS 
(Kroonenberg 2008). Our proposal involves the trivariate moment decomposition of 
the data (Lombardo et al. 2017) and the visualization of category trends using poly- 
nomial biplots, both of which procedures will be explained below. Indeed, from 
a geometrical point of view, modelling the association using a new dimensional 
space based on Emerson’s polynomial components for the ordered variables, can 
be seen as the major purpose of ordered three-way correspondence analysis. Here 
we consider the case where not all the three variables have ordered categories. As a 
consequence, we merge the various features of the Tucker3 decomposition with the 
trivariate moment decomposition using Emerson’s polynomials and Tucker3 com- 
ponents, via hybrid decompositions. Modelling this association using Emerson’s 
polynomials and Tucker3 components deserves special attention as does the con- 
struction of the graphical representations. 

The paper is organised as follows. In Section 2 we briefly present classic three- 
way correspondence analysis. Section 3 describes its ordinal variant based on or- 
thogonal polynomials and Section 4 presents some possible hybrid decompositions. 
In Section 5, we will illustrate in brief the proposed analysis using data from the 
results of a survey of the Dutch Central Bureau of Statistics (Israéls 1987). 


2 Three-way correspondence analysis for unordered variables 


Three-way correspondence analysis can be viewed as a generalization of two-way 
correspondence analysis for analysing the three-way chi-squared statistic X, È: g oF its 
analog X7,x /n. This index can be partitioned into four orthogonal terms (see, for 
example, Lancaster 1951) where the first three terms are the pairwise chi-squared 
statistics obtained by aggregating across the categories of each variable and the last 
term represents the trivariate interaction among the variables. That is, the sum of 
these four terms gives Pearson’s mean squared three-way contingency coefficient. 


Xx /n = Xîy/n+Xfg/n+Xîg/n+Xîy/n. ad) 


To decompose X? k /n, we can consider a three-way generalisation of the singular 
value decomposition, i.e. the Tucker3 model decomposition (Tucker 1966; Kroo- 
nenberg 2008, p. 54; Beh and Lombardo 2014, Section 11.6) which computes the 
components for each of the spaces of the three categorical variables, and in addition, 
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a three-way core array containing the elements that reflect the strength of the links 
among these components. For a detailed discussion of three-way correspondence 
analysis for nominal variables, see Carlier and Kroonenberg (1996), Kroonenberg 
(2008, Chap. 17), Beh and Lombardo (2014, Chapter 11). Let P = (p;jx) be a gen- 
eral three-way table of joint relative frequencies from the cross-classification of n 
units according to three variables, called row, column and tube variables, respec- 
tively. Define D;,D; and Dx as I x I, J x J and K x K diagonal matrices whose 
general elements are the row, column and tube marginal proportions, piee, Pe je and 
Pijk 
deviations from the three-way independence model. 


Using the Tucker3 model, the general form of the X? /n decomposition can be 
written as 


Peek, respectively. Furthermore, let II = ( —1=7; it) be the array of the 


Il = AG(B@C)'+e=T +e, (2) 


where II is the flattened table of the deviations from the three-way independence 
model of dimension J x JK; G is the flattened core array of dimension P x QR, 
and A,B and C are the component matrices associated with the row, column and 
tube variables, of dimension J x P, J x Q and K x R (with P< I,Q < J and R < 
K), respectively, and e represents the error of approximation between the observed 
Tijk and their predicted values #;;,. When P = I, Q = J,R = K, Equation (2) yields 
an exact decomposition. The component matrices are orthonormal with respect to 
D;, D; and Dx. The general elements of the core array can be interpreted as the 
generalized, or three-way analogue, of the two-way singular values. For the sake of 
brevity, we do not provide a comprehensive description of the bivariate interaction 
terms of X? g /n which can also be modelled. 


3 Three-way correspondence analysis for ordered variables 


When the categories of the variables are ordered, Lombardo et al. (2017) recently 
proposed to incorporate such a design feature into the decomposition model by 
replacing the Tucker3 components with the orthogonal polynomials of Emerson 
(1968), thereby defining the trivariate moment decomposition. To form the poly- 
nomial basis we can generate as many orthogonal polynomials as there are or- 
dered categories. In general, the first polynomial for each ordered variable of the 
three-way table, represents the zeroth-order polynomial and consists of values that 
are all constant.The second polynomial is the first-order polynomial that describes 
the variation in the location of the categories. The third polynomial is the second- 
order orthogonal polynomial that reflects the variation in the dispersion of the cat- 
egories. Higher-order polynomials represent higher-order moments of the ordered 
categories. Emerson considered a computationally efficient way of calculating these 
orthogonal polynomials by using a three-term recurrence relation; see for example 
Beh and Lombardo (2014, p. 94) and Lombardo et al. (2016). 
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The matrices of the row, column and tube orthogonal polynomials are denoted 
by Y = {Qiu}, (for i= 1,..,J and u=0,..,J—1) Z = {By} (for j = 1,..,J and 
v=0,..,J—1) and € = {Yay} (for k = 1,..,K and w = 0,..,K — 1), respectively. 
Like the Tucker3 components, the row polynomials are orthogonal with respect to 
the marginal proportions Piee. The column and tube polynomials are orthogonal with 
respect to the marginal proportions peje and Pex, respectively. As mentioned above, 
the zeroth-order orthogonal polynomial is a constant, the first polynomial is linear 
(respecting the ordinality of categories), the second polynomial is quadratic (de- 
scribing the variation of the category dispersion), and so on. The trivariate moment 
decomposition also calculates the generalized correlations that replace the links be- 
tween components in the core array; they are also referred to as the trivariate gen- 
eralized correlation between the u-th-order polynomial component of the first vari- 
able, the v-th-order polynomial component of the second variable and the w-th-order 
polynomial component of the third variable. Like the core array, the generalized cor- 
relation table of dimension U x V x W is not super-diagonal. Its flattened form is 
given by Z = &'D II (Dj 4% @Dx@). By replacing A, B and C in Equation (2) with 
their orthogonal polynomial equivalents, &, 4 and @, Lombardo, et al. (2017) 
demonstrated that the decomposition of the three-way Pearson’s coefficient, X3 g /n 
can be expressed in the similar manner as in Equation (2) such that 


Il= AZ(BSE)' +e. (3) 


The dimension of the Z table is U x VW (with U < I,V < J and W < K) where 
U,V and W represent the number of columns in the polynomial component matrices 
A,B and € for the first, second and third way, respectively, and e is an error term 
that is equal to zero when U = /,V = Q and W = K. As for the Tucker3 model, 
the bivariate and trivariate interaction terms of the global association can also be 
modelled; see Equation (1). 


4 Three-way correspondence analysis for mixed variables 


When the three-way contingency table consists of nominal and ordinal variables, the 
approximation of the global dependence, or total inertia, in II involves computing 
the Tucker3 components for the nominal variables and Emerson’s polynomial for the 
ordered variables. Two cases involving such a structure can arise: 1) one nominal 
and two ordered variables; 2) two nominal variables and only one ordered. For three 
ordered variables refer to Lombardo et al. (2017). After computing the polynomial 
for an ordered variable, say the column variable, and the Tucker3 components for 
the row and tube variables, the hybrid decomposition takes on the form 


I=AZ(B®C) +e (4) 
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where e is the error term and the components A, C and 4 for the row, tube and col- 
umn variables are computed using an iterative hybrid algorithm. At the first stage of 
the algorithm, the components A and C are derived using the singular vectors of the 
flattened tables IL, zg and IIg 77, respectively. The components of the third ordered 
variable are computed by Emerson’s polynomials related to IIj.7x. A full rank de- 
composition of the association in a three-way contingency table using the hybrid 
models may be achieved by choosing (P= /,Q = J,R = K,U =1,V =J,W = K) 
provided that the products of PQ > R,UV > W, PR > Q, UW > V, and QR >P, 
VW > U (Kroonenberg, 2008, p. 66), in this case the convergence of the algorithm 
is quickly reached. While the number of polynomials should be always equal to the 
number of categories in a variable (Lombardo et al. 2017), the number of Tucker3 
components can be smaller. In this situation, the convergence of the hybrid algo- 
rithm will be reached only after a small number of iterative steps. A full decom- 
position is always used when all the three variables are ordered, as it is for model 
(3), but is seldom used in practice when the variables are not all ordered. In all 
cases, given the orthogonality of the components A,B and C, and of the polyno- 
mials <,4,@ with respect to the marginal matrices D;, D; and Dx, respectively, 
the inertia or Pearson’s mean squared contingency coefficient can be written also in 


2 
terms of generalized correlations, such that “ux = |]? = |G|? = || Z]? 


4.1 Polynomial biplots 


To graphically depict the association structure in a three-way contingency table 
where at least one categorical variable is ordinal, we shall focus here on the single- 
variable polynomial biplot. This is only one of a variety of different types of biplots 
discussed in the literature; see for example, Kroonenberg (2008), Carlier and Kroo- 
nenberg (1996), Gower et al. (2016) and Lombardo at al. (2017). For the sake of 
brevity and consistency with the example illustrated in the next Section, here we 
define the coordinates of the single-variable polynomial biplot when only the col- 
umn variable is ordinal. The principal polynomial coordinates, or reference-mode 
coordinates, are defined as 


P 
Fuxvr) = AZ(pxvR) (yi Se aipzpvr} 
p=1 


and are the row Tucker3 components weighted by the generalized correlations. The 
column-tube interactive coordinates are standard polynomial coordinates and are 
defined as 


Hyxxvr) = (BBC) {= hjkvr = Bivew}. (5) 
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For this biplot, the coordinates for both the row and column-tube categories are 
displayed in the space defined by the column xtube interactive components. We 
get as many interactive polynomial axes as the product number of the V column 
polynomials times the R tube Tucker3 components. 


5 Example: Shoplifting data 


To illustrate the applicability of three-way, ordered and nominal, correspondence 
analysis, we consider the contingency table from the a survey undertaken by the 
Dutch Central Bureau of Statistics (Israéls 1987). The data concerns the number 
of men and women suspected of shoplifting in 1977 and 1978, both in Dutch gen- 
eral stores and in big textile stores. The row categories consists of 13 items stolen: 
clothing, clothing accessory, tobacco and/or provisions, stationary, books, records, 
household goods, candy, toys, jewelry, perfume, hobby and/or tools and other items. 
The column categories consist of 9 age groups (in years) of the perpetrators: less 
than 12, 12 to 14, 15 to 17, 18 to 20, 21 to 29, 30 to 39, 40 to 49, 50 to 64, 65 
and over, and the tube categories are male and female. The X* = 22317.9 indicates 
that there is a strong significant association among the three variables. The chosen 
number of dimensions of the hybrid model is P = 4,V = 9,R = 2 where the algo- 
rithm converges after 10 iterations and reconstructs about 92% of the total inertia. 
The main purpose of our study of these data is to describe the shoplifting behaviour 


Principal Axis 2 


Principal Axis 1 
Fig. 1 Shoplifting data: three-way CA 


of items as a function of age and gender. We want to investigate how does the age of 
the perpetrator and gender influence the types of items that are stolen. What sources 
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Polynomial Axis 2 
Fig. 2 Shoplifting data: Polynomial biplot, ordered three-way CA-linear trend 


of variation of the age groups for males or females can help to describe this asso- 
ciation? For the sake of brevity, here we only look at the linear polynomial of age 
that can allow to see when there is a linear growth or decline over time for males 
and females. To highlight the difference of the hybrid analysis with respect to the 
classic analysis, we first display in Figure | the results of the classic three-way cor- 
respondence analysis. Figure 1 shows the nested-mode biplot classically proposed 
for three-way correspondence analysis (see Kroonenberg 2008, p.443), where the 
age and sex categories are combined to reflect the interactive nature of the variables 
along each axis. Here the origin of the plot indicates no association. It shows that 
males in the young age groups have a propensity to steal toys, candy and stationary 
and males in the older age groups together with the young female /3F principally 
stole household, goods and tobacco. Further it also shows that the female young 
age group, less than 12, mainly stole clothing. However it does not illustrate a clear 
trend of the age groups or a prevalence of the males with respect to the female per- 
petrators, unlike the polynomial biplot of Figure 2. To portray the association among 
the three categorical variables where only the column variable is ordered we use the 
polynomial biplot described in Section 4.1. Since we have only the columns that 
are ordered, we obtain nine ordered column polynomials (Oth, Ist, 2nd, etc.) and 
two Tucker3 components for the gender variable. In total we compute 18 interactive 
polynomial axes- the first is the combination of the constant column polynomial 
and the first Tucker3 component of the gender variable, the second interactive axis 
is defined by the combination of the linear column polynomial and the first Tucker3 
component of the gender variable and so on. This axis reflects changes in the linear- 
ity of columns (ages) given the tube categories (gender). In Figure 2, we consider 
the 2nd and 11th axis, since the eleventh axis represents the combination of the lin- 
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ear column polynomial and the second Tucker3 component of the tube variable -it 
again represents the linearity of ages given the gender. The origin of the axis rep- 
resents the mean age of the perpetrators. Figure 2 shows a clear linear trend of age 
groups for the males and females. The male linear trend is associated with the first 
axis while the female linear trend is associated with the second axis. It is more evi- 
dent that males stole a great variety of items than females (indeed there were 22597 
stolen items by males against 10504 stolen items by females). On the right-hand 
side of the first axis, we observe that males in the young age groups had a propen- 
sity to steal toys, candy and stationary. The left-hand side of the first axis shows that 
males in the older age groups principally stole records, books, household, perfurms, 
other and accessories. Further it also shows that females in the young age groups 
stole mainly clothing and jewelry, while female in the older age groups mainly stole 
hobby and tobacco items. 


6 Conclusion 


In this paper we have discussed in a unified framework three-way correspondence 
analysis consisting of nominal and ordered categorical variables. We have done this 
by considering the use of the Tucker3 orthogonal base with the orthogonal poly- 
nomial base (Emerson 1968) defining a three-way hybrid decomposition. We have 
proposed that one way to visually summarize the association is to consider the poly- 
nomial biplot; see Figure 2. By observing the proximity and position of the points 
in this display, we can obtain a more detailed summary of the associations among 
the variables in terms of category trends. 
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Log-mean linear models for causal inference 
Modelli log-mean linear per inferenza causale 


Monia Lupparelli and Alessandra Mattei 


Abstract We discuss a log-mean linear regression approach to deal with causal in- 
ference when the interest is in assessing the effect of treatment on a set of multiple 
(binary) outcomes which might be not independent. We explore how the effect of 
treatment on joint outcomes can be decomposed considering the effect on single 
outcomes and the effect on their joint distribution. The method is illustrated through 
a randomized experiment concerning the effect of honey on nocturnal cough asso- 
ciated with childhood upper respiratory tract infections. 

Abstract Un approccio basato su regressioni log-mean linear viene discusso per 
fare inferenza causale quando si è interessati a valutare l’effetto di un trattamento su 
variabili risposta (binarie) multiple che potrebbero non essere indipendenti. Si es- 
plora come l’effetto del trattamento su variabili congiunte possa essere decomposto 
considerando l’effetto sulle singole variabili e sulla loro distribuzione congiunta. Il 
metodo viene illustrato attraverso uno studio randomizzato relativo all’effetto del 
miele sulla tosse notturna durante l’infanzia in presenza di infezioni del tratto res- 
piratorio. 


Key words: Causal Relative Risks; Intrinsic and Extrinsic causal effects; Product 
potential outcomes 


1 Introduction 


We focus on assessing causal effects on multiple binary outcomes using the poten- 
tial outcome approach to causal inference (see [1] for a comprehensive review). In 
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particular, we define an enlarged set of potential outcomes of interest and suitable 
relative risk causal estimands for these outcomes. We also show that casual effects 
on joint outcomes can be decomposed into two components accounting respectively 
for the effects on single outcomes and the effect of their joint dependence structure. 
For inference, we propose the log-mean linear regression approach for multiple bi- 
nary outcomes recently developed by [2]. 

We apply our framework to a real analysis of a double-blinded randomization 
study, that we name the Honey study, aimed to evaluate the effects of a single 
nocturnal dose of buckwheat honey versus no treatment on nocturnal cough and 
sleep difficulty associated with childhood upper respiratory tract infections. From 
September 2005 through March 2006, 72 children aged between 2 and 18 years 
with cough attributes characterized by the presence of rhinorrhea and cough for 7 
or fewer days duration were recruited on presentation for an acute care visit from a 
single university-affiliated pediatric practice in Hershey, Pennsylvania (USA). Sub- 
jective parental assessments about their child cough symptoms were assessed both 
previous and after the treatment administration (see [4] for further details about the 
study). Here we explore the effect of honey on three attributes of nocturnal cough: 
Cough Bothersome, Cough Frequency and Cough Severity. 


2 The model 


2.1 Preliminaries and background 


Given the finite set V = {1,..., p}, let Yy = (Y,) vey be the vector of binary outcomes 
of interest. For every D C V, let Yp = 1p be the event of joint success given that 
Y, = I, for each v € D, where 1p is a vector of 1s of size |D|. Then, for every non- 
empty subset D of V, we define the product outcome 


y? = JĮ} (1) 


veD 


which is a binary variable taking value 1 in case of joint success and 0 otherwise. 
The interest in our analysis is exploring the effect of a treatment both on joint suc- 
cesses, that is, the effect on each product outcomes, as well as on single outcomes. 
Therefore, let YY = (Y? ) pcv,p+o be the augmented vector of outcomes of interest. 

In order to formally define the causal effects of interest, we first introduce the po- 
tential outcome approach to causal inference. We take a super-population perspec- 
tive, considering the observed group of units as a random sample from an infinite 
super-population, and we focus on causal effects of a binary treatment, so that, each 
unit in the sample can potentially be assigned to an active treatment group (w = 1) 
or to a control group (w = 0). Under the Stable Unit Treatment Value Assumption 
(SUTVA, [6]), we can define for each outcome variable, Y,, v € V, two potential 
outcomes for each unit. Let Y,(0) denote the value of Y, under treatment w = 0, and 
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let Y,(1) denote the value of Y, under treatment w = 1. Let Yy (w) = (Y,(w))vev be 
the random vector including potential outcomes for every variable under treatment 
level w, w = 0, 1. Then, for every non-empty subset D of V, let 


Y?(w)=[]%(w), w=0,1 (2) 


veD 


be the potential product outcome and let YY (w) = (Y?(w))ncv,pyo be the vector of 
all product potential outcomes, for w = 0, 1. Let P(Y?(w) = 1) denote the probabil- 
ity of the joint success Yp = 1p under treatment w = 0, 1, for any non-empty subset 
Dot V. 


2.2 Causal estimands 


In the potential outcome approach causal effects are defined as comparison of po- 
tential outcomes, Y,(1) versus Y,(0), for a common set of units, for every v € V. 
PY ()=1) 
P(%,(0)=1) 
Then, we define the causal relative risk for any product outcome Y?: 


We focus on causal relative risks, that is RR, = , for each single outcome. 


RRp = DEV. (3) 


PoC 
P(Y?( 


© 
2 
Il 
a 
= 


We adopt the convention, RRg = 1. 

For any product outcome Y? € YV, we expect that the causal effect in (3) com- 
bines the effect of the treatment on “nested” product outcome Y D' for any non- 
empty subset D’ strictly included in D, with the effect of the treatment on the joint 
distribution of Yp with D C V. In particular, for any YP, we define the intrinsic 
causal effect ICE) and the extrinsic causal effect (ECE): 


ICEp = 8(RRp)p'cp, DEV (4) 
ECEp = h[P(Yp(0),Yp(1))]}, DEV, ©) 


for two suitable functions g(-) and 4(-). Basically, for any YP € YV, the ICE ac- 
counts for the effect of treatment deriving from the product structure of YP, whereas 
the ECE accounts for the effect of treatment of the joint (dependence) structure of 
Yp. We expect that for any product outcome Y? € YY, the causal effect of treatment 
defined in (3) is a suitable combination of the intrinsic and the extrinsic components 
in (4) and in (5), respectively. 
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2.3 Observed data and assignment mechanism 


Unfortunately, we cannot directly observe both Yy (0) and Yy (1) for any subject. For 
each unit let W denote the actual treatment assignment: W = 0 for units assigned to 
the control group, and W = 1 for units assigned to the treatment group. We observe 
Yg»: = Yy (W) =W -Yy (1)+ (1 —W) -Yy (0), but the other potential outcomes, Y{" = 
YW(1-W)=(1-W)-Yy(1)+W-Yy(0), are missing. Therefore, in order to learn 
about the causal effects of interest it is crucial to posit an assignment mechanism. In 
what follows, we will maintain the following assumption: 


Assumption 1 Random treatment assignment: P(W|Yy(0),Yv(1),X)= P(W) where 
X is a vector of observed covariates 


Under the randomization assumption, which holds by design in randomized ex- 
periments, we propose a model-based approach to causal inference. Specifically, we 
propose a log-mean linear model for potential outcomes, and we derive maximum 
likelihood estimators of the causal parameters of interest. This approach appears to 
be particularly appealing because the causal effects of interest are directly related to 
model parameters. 


2.4 Log-mean linear model 


We assume that each random binary vector Yy(w) with w = 0,1 follows a Multi- 
variate Bernoulli distribution with mean parameter vector Uy(w) = (Lp(w))pcv. 
with 

Ltp(w) = P(Y = Ib), w= 0,1, DCV. (6) 
Therefore, any product potential outcome Y? (w) is a Bernoulli variable with proba- 


bility parameter up, with D C V. From [5], let yy (w) = (Yo(w))pcv be the log-mean 
linear parameter vector of the joint distribution of Yy (w), with 


Ww) = Y (-1)logup(w), w=0,1, DEV. (7) 
D'CD 


Following [2] we propose a log-mean linear regression for modelling the distri- 
bution of the multivariate potential outcome Yy (w). The resulting model is given by 
a sequence of joint regressions 


Yp(w) =A+Bpw, w=0,1, DGV. (8) 


[2] proved that logRRp = Lncp Bp: and, therefore, the treatment effect in Equation 
(3) is 
RRp = [] exp(Bp), DEV, (9) 
D'CD 


for any product outcome Y? € YV. Furthermore, for any D C V, we define 
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Table 1 Honey Study: Estimates of the causal effects (standard errors in parentheses) 


Relative Extrinsic Intrinsic 

Risk Estimate Causal Effect Estimate Causal Effect Estimate 
RRg 1.578 (0.344) 

RRF 1.923 (0.511) 

RRs 1.537 (0.412) 


RRigr} 1.914 (0.554) ECE;p r} 0.631 (0.128) ICEyp r} 3.034 (1.342) 
RRipsy 1.581 (0.453) ECE{g,s} 0.652 (0.134) ICE{g5) 2425 (1.086) 
RR{rsy 1.725 (0.516) ECE:ps) 0.583 (0.140) ICEjrs} 2.956 (1.495) 
RRip rs} 1.632 (0.499) ECE;g.ps) 1459 (0.288) ICE{g.ps) 1.119 (0.304) 


yy —1 

ICE» = J] Re | (10) 
D'cD 

ECEp = exp(Bp). (11) 


Next proposition shows that the decomposition of the causal effect for each product 
outcome Y? into the intrinsic and the extrinsic component naturally follows. 


Proposition 1. Under the log-mean linear regression approach in Equation (8), for 
any product outcome YP, 


RRp =ICEp x ECEp, DCV. (12) 


Proof. The proof derives from Lemma 3 and Proposition 1 in [3] marginaliz- 
ing over the set of covariates. In particular, the result follows because ICEp = 


Mp'epexp(Bp'). 


Notice that in case of a single outcome Y,, ICE, = 1 and RR, = ECE, = fy, for any 
v € V. Furthermore, in case Yp should be a subset of independent outcomes, from 
[5] we have that Bp = 0 and, therefore, RRp = ICEp, for any D C V. Finally, see 
[3] for an in-depth discussion about intrinsic and extrinsic effects in a wider context 
including also a set of covariates, which allows us to assess the heterogeneity of the 
causal effects. 


3 The Honey study 


Given the finite set V = {b, f,s}, let Yy = (Yp, Yp, Ys) define the vector of the three 
binary outcomes associated respectively to the three cough attributes, Bothersome, 
Frequency and Severity. These outcomes have been properly dichotomized such 
that they take level 1 in case the attribute is absent or low and the level 0 otherwise 
(see [3] for details). We are interested in exploring the effect of honey on achieving 
joint successes, that is, the effect on each single outcome Yp, Yf, Y; and on the four 
resulting product outcomes, shortly denoted as Y bf ys Y/S and YPf5, Therefore, 
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let YY = (YP)pev.pza = (Yp,Y;,Y;,Y?f,Y®5,Y/S,y?/5) be the augmented vector of 
outcomes of interest. 

We specified a log-mean linear regression model for the joint distribution of the 
three outcomes Yy = (Yp, Yr, Y;) related to the cough attributes in the honey data set. 
The estimates of all causal effects with their standard errors are collected in Table 1. 

We get positive estimates of the honey causal effects on each single outcome, and 
in particular the honey shows the strongest positive effect in reducing the frequency 
of nocturnal cough. Moreover the treatment has a positive effect on any product out- 
come and this means that the honey improves conditions of children with more than 
one critical cough attribute. In particular the honey treatment is more effective for 
patterns including the frequency attribute. We also notice that for product outcomes, 
the intrinsic effect is always stronger than the extrinsic one, except for the greatest 
pattern including all the attributes. These results show that the effect of treatment on 
the outcome dependent structure is not negligible and a multivariate approach for 
causal inference is definitively more suitable than an univariate one. 
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Research on the Risk Factors accountable for the 
occurrence of degenerative complications of type 
2 diabetes in Morocco: a prospective study 


Ricerca di fattori di rischio legati all’occorrenza di 
complicanze degenerative del diabete di tipo 2 in 
Marocco: uno studio prospettico 


Badiaa Lyoussi?. Zineb Selihi1?, Mohamed Berraho!, Karima El Rhazi!, Youness El 
Achhab!, Adiba El Marrakchi? , Chakib Nejjari!4 


Abstract 

Aims: Our study aims to determine associated risk factors with complications of diabetes in 
patients with type 2 diabetes followed in primary care centers in Morocco. 

Methods: We conducted a nested case — control study. Cases were type 2 diabetic’s patients 
who suffered from degenerative complication after diabetes diagnosis; controls were type 2 
diabetic’s patients with no complications of diabetes at the time of inclusion in the cohort. The 
analysis was performed separately for women and men in order to determine the specificity of 
each sex factor. 

Results: 732 patients with or without complications were identified. Retinopathy is the most 
frequent (41.2%) followed by diabetic neuropathy (28.4%) and cardiovascular complications 
(26.2%). For women, low economic level (ORadj = 11.36, 95% CI 5.59 - 23.25), forget the 
treatment (ORadj = 3.42, 95% CI 1.29 - 9.09), urban environment (OR adj = 3.97, 95% CI 0.04 - 
0.17), very high level of stress (ORadj = 2.94, 95% CI 1.00 - 8.63), and overweight (ORadj = 
2.50, 95% CI 1.12 - 5.53), remained significant with the risk of degenerative complications 
after adjustment. 

However, in unadjusted analysis for men, the low socioeconomic level and the patients without 
professional activities increased the degenerative complication risk. The patients with 
overweight [5.96 (95% CI: 1.61 — 22.10)], with dyslipidemia [3.09 (95% CI: 1.51- 6.33)] and 
patients treated by a general physician [4.57 (95% CI: 1.24 — 16.82)] were a higher risk for 
degenerative complication. 

Conclusion: These findings suggest that some risk factors of degenerative complication of type 
2 diabetes are strongly linked with the Moroccan context. This study highlighted important 
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areas for health care intervention and provided a reminder for vigilance when known risk factors 
for complications are present. 


Key words: Type 2 diabetes, Risk factor, Degenerative complication 


Introduction 


Morocco is currently experiencing a demographic, nutritional and epidemiological 
transition, with a proliferation of cases of diabetes similar with current trends on a 
global scale. Its degenerative complications represent a heavy burden in terms of 
morbidity, mortality, but also in terms of impact and socio-economic cost. Many of 
the complications of diabetes can be avoided or delayed by preventative measures and 
programs to manage this disease. Indeed, actions and preventive measures targeting 
the determinants of complications of diabetes require their knowledge and 
identification. Besides, real work missed about the complications of diabetes and these 
determinants. 
This work was included in the study of the degenerative complications of diabetes in 
the Moroccan diabetic in terms of descriptive and analytical epidemiology by studying 
the main factors associated with degenerative complications. Use this like a template. 
Instead of simply listing Section headings of different levels we recommend to let 
every heading be followed by at least a short passage of text. Please note that the first 
line of text that follows a Section heading is not indented, whereas the first lines of all 
subsequent paragraphs are. 


Methodology 


1. Study subjects 

The present study was conducted as a nested case-control study in a cohort of type 2 
diabetes patients. The recruitment of cases and controls is made from diabetic patients 
in the EpiDiaM cohort (Epidemiology Diabetes Morocco). 


2. Case/Control study: 

We recruited 366 cases and 366 controls. 

The case: 

All diabetic patients in our cohort with one or more complications of diabetes (Macro- 
vascular, nephropathy, neuropathy and retinopathy) have been our target population 
of cases. We excluded patients who had complications before diagnosis of diabetes 
and patients with inability to determine the dates for the diagnosis of complication. 
The control: 

All diabetic patients in our cohort with no complications represented our target 
population controls. Were defined as controls for this study; diabetic with no 
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complications of diabetes at the time of inclusion in the cohort. Controls were matched 
to cases on age (+5 years), sex and diabetes duration (+ 5 years). We excluded patients 
with complications indeterminate situation (Figure 1). 


Figure 1: Flow chart of eligibility for the study and inclusion in the analysis 


Source population 
1196 Type II Diabetics ‘EpiDiaM* 


249 Diabetics with 
complication : 
concomitant or befor 
the diagnosis of diabetes 


947 diabetics with or without complication after the 


diagnosis of diabetes 


‘atching: 
Age + 5 years 
Duration of diabetes + 5 years 
Not eligible for Gender (Female / Male) 
complications 


study 


732 Eligible Subject 


366 diabetic 
uncomplicated 


366 diabetic with 
complication 


1 control / 1 case 


366 controls of type II diabetes 
(controls Group) 


366 cases with complication of 
type II diabetes (Case group) 


3. Analytic Plan 

The analysis was adjusted for all potential confounders. Sex is known as a modifier 
factor for association between risk factors and cardiovascular complications [1 - 2]. 
In addition, some factors are specific to women (such as contraception) and other men 
especially in the Moroccan context (tobacco, alcohol ...). For these reasons the 
analysis was performed separately for women and men in order to determine the 
specificity of each sex factor. P = 0.05 was the level of statistical significance. All 
analyses were performed using SPSS version 20. 


Results 


To meet our objective, in the study of factors related to degenerative complications of 
type 2 diabetes, we showed that among 732 patients (366 cases with complication (s) 
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and 366 controls) eligible. We first identified a total of 1,196 diabetic patients from 
the EPIDIAM cohort study. The majority of the population was female (77.7%). Mean 
age was 57.5 + 10.4 years and mean diabetes duration was 8+6.60 years. 41.2% had 
retinopathy and 28.4 % had diabetic neuropathy, while 26.2% had cardiovascular 
complications. In the multivariate analysis, a higher risk of degenerative 
complications was observed among women with low economic status (OR, adj = 
11.36, 95% CI 5.59 to 23.25), women having forgotten the their treatment (Adj. OR 
= 3.42, 95% CI 1.29-9.09), urban women (adj. OR = 3.97, 95% CI 0.04 - 0.17), 
Women with a very high level of stress (adj. OR 2.94, 95% CI 1.00 to 8.63), and 
overweight women (adj. OR = 2.50, 95% CI 1.12 to 5.53 ). However, for men, in the 
unadjusted analysis, low socio-economic and occupational inactivity were increased 
the risk of degenerative complication. Overweight men [5.96 (95% CI: 1.61-22.10)], 
with dyslipidemia [3.09 (95% CI: 1.51-6.33)] and patients treated with a general 
practitioner [4.57 (95% CI: 1.24 to 16.82)] had a higher risk of degenerative 
complications. Our results confirm that the risk factors for degenerative complication 
of type 2 diabetes are strongly related to the Moroccan context. 


Discussion 


We conducted a case-control study nested in a cohort to determine the factors 
associated with degenerative complications of type 2 diabetes in Morocco. 

For women, we observed that the low socioeconomic level was associated with a 
significant increasing on the risk of degenerative complications among women. 
Although its mechanism has not been completely clarified, the low socioeconomic 
level can be a responsible of development of complications of diabetes with type 2 
through different and complex processes [3]. In Moroccan context, poor women, like 
those in other studies among low-income women [4-5], often put their families' needs 
and preferences before their own. 

However, higher levels of psychosocial stress may affect a person’s socioeconomic 
status, use of medical services and overall health [6-7]. Surwit et al. showed that stress 
management training for one year was associated with a reduction significance of 
HbA 1c. However, very anxious patients didn't obtain a reduction in HbA Ic level [8]. 
Additionally, obesity is associated with increased risks for complications [9-10]. This 
connection is maintained after adjustment with other risk factors, with risk multiplied 
by 2.5 in women. Obesity is an important modifiable risk factor for type 2 diabetes 
[11], cardiovascular disease [12-13] and renal failure [14]. In some regions of 
Morocco, especially the South, the weight of the woman is even seen as a competitive 
advantage increasing her chances of finding a husband [15]. However, our analysis 
did not indicate a significant association between complication risk and physical 
activity. 

While the observed association between the area and complication risks was 
demonstrated in previous epidemiologic studies [16-17], the risk of degenerative 
complications is multiplied by 3.9 in women who reside in urban areas. This may be 
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due to the sedentary lifestyle as well as a reduced access to a healthy food in urban 
areas. 

For men, we have not been able to do a multivariate analysis. Due to lack of statistical 
power (low number of diabetic patients) conditions for statistical modeling was not 
satisfied. 


To our knowledge, in Morocco our case-control study is the first to directly analyze 
the relationship between degenerative complication and all risk factors that may lead 
to the occurrence of these complications among Moroccan diabetics. Therefore, our 
study added an important knowledge of risk factors responsible for the occurrence of 
degenerative complications resulting in hospitalization or death. This study 
highlighted important areas for health care intervention and provided a reminder for 
vigilance when known risk factors for complications are present. 
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Bootstrap group penalty for high-dimensional 
regression models 


Una procedura di penalizzazione bootstrap a gruppi per 
modelli di regressione ad alta dimensionalita 


Valentina Mameli, Debora Slanzi and Irene Poli 


Abstract The paper presents a new penalization procedure for variable selection in 
regression models. We propose the Bootstrap Group Penalty (BGP) that extends the 
bootstrap version of the LASSO method by taking into account the grouping struc- 
ture which may be present or introduced in a model. Based on a simulation study we 
demonstrate that the new procedure outperforms some existing group penalization 
methods in terms of both prediction accuracy and variable selection quality. 
Abstract Il presente lavoro propone una nuova procedura per la selezione delle 
variabili in modelli di regressione penalizzati, chiamata penalizzazione bootstrap a 
gruppi (BGP), che estende la versione bootstrap del metodo LASSO tenendo conto 
della struttura di raggruppamento fra i predittori che può essere presente o può es- 
sere introdotta in un modello. Uno studio di simulazione rivela che la nuova proce- 
dura fornisce risultati migliori rispetto ad alcuni metodi di penalizzazione a gruppo 
esistenti. La bontà di questi risultati è misurata sia in termini di accuratezza di pre- 
visione sia in termini di qualità nella selezione delle variabili. 


Key words: Bi-level selection, high-dimensionality, regression models 


1 Introduction 


One of the most challenging problems arising in many scientific contexts is mod- 
elling data characterised by a huge number of variables interacting with each other 
in some complex and unknown pattern. Often, the sample size considered in the 
analysis is small compared to the number of variables. The development of new 
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statistical tools tailored to analyse these data is therefore crucial in contemporary 
statistical learning. Many proposals are available in literature with the main aim 
of reducing the dimensionality and selecting the most relevant variables for the 
problem under study [Fan and Ly, 2010]. An important line of research is con- 
cerned with penalized regression procedures; see for example [Fan and Lv, 2010, 
Breheny and Huang, 2009, Tibshirani, 1996, Zhang, 2010]. The sparsity condition 
is widely considered under this scenario assuming that only a subset of predictors 
is associated with the response variable. The penalized procedures with the sparsity 
assumption are designed with the aim of both selecting the most relevant predictors 
and estimating the parameters of the model. The penalties in regression models can 
be subdivided into three wide classes which relate to individual variable selection, 
group variable selection and bi-level variable selection. In this paper we propose 
a new penalization procedure which we call the Bootstrap Group Penalty (BGP) 
which is obtained by coupling the properties of group variable and bi-level selec- 
tion methods with the bootstrap re-sampling methods. The approach extends the 
work of Bach (2008) where a bootstrap version of LASSO was proposed. 

The paper proceeds as follows. In Section 2 we review penalized procedures and 
we introduce the Bootstrap Group Penalty procedure. The Section 3 investigates 
the performance of our method through a simulation study. In Section 4 we present 
some concluding remarks on the results of the simulation study. 


2 The Bootstrap Group Penalty procedure 


2.1 Model set-up 


We consider the multiple linear regression model 


yi = XB + £i, i=1,...,n (1) 
where X; = (xj1,... Kip)? is a p-dimensional vector of predictors, y; is the response 
variable, £; is the error term and B = (;,...,B,) is the regression vector of p un- 


known parameters which must be estimated from the data. It is known that when the 
number of covariates (dimension of the system) considerably exceeds the number 
of observations (p >> n) the model is not identifiable from a statistical perspective 
and therefore not estimable with standard classical statistical procedures. Penalized 
regression procedures, also known as regularized regression methods, are important 
methods frequently used in high dimensional problems as they are able to estimate 
reliable models and to improve on prediction capabilities, also when the number 
of predictors is much larger than the number of observations. In order to estimate 
the vector of regression coefficients B we minimize the so-called penalized least 
squares function, defined as 
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Q(B) = 5-(y— XB)" w-XB) +P(B|A), 


where y = (y1,...;Yn) and X = (Xj,...,Xn)" is the n x p design matrix. The function 
P(-) is a penalty on the regression coefficient parameters B which controls the com- 
plexity of the model. The parameter A is a tuning parameter which can be selected 
using cross validation or information criteria like the Akaike and the Bayesian in- 
formation criteria [Akaike, 1974, Schwarz, 1978]. Various penalties have been pro- 
posed in the literature, see for example [Fan and Lv, 2010] and [Huang ef al., 2012]. 
These penalties can be subdivided into three big groups which relate to individual 
variable selection, group variable selection and bi-level variable selection. There 
is a large literature on penalized regression procedures for individual variable se- 
lection; the least absolute shrinkage selection operator (LASSO) proposed by Tib- 
shirani (1996) is surely the most used and famous procedure. Recently, there has 
been a large number of works extending these approaches to grouped predictors; 
indeed it is possible to take account of a grouping structure among the predictors in 
order to improve model prediction capabilities ([Ogutu and Piepho, 2014]). When 
a grouping structure is introduced into a model, interest may rely entirely on se- 
lecting relevant groups and not individual variables, but when both individual vari- 
ables and groups are important we will consider bi-level selection procedures to se- 
lect both the important groups and variables within these groups. Under this setup, 
among others, we can cite the following variable selection procedures, the group 
LASSO ([Yuan and Lin, 2006]), the smoothly clipped absolute deviation penalty 
([Fan and Li, 2001]), the minimax concave penalty method ([Zhang, 2010]), the 
composite MCP ([Breheny and Huang, 2009]), the group bridge penalty ([Fu, 1998]) 
and the group exponential LASSO ([Breheny, 2015]). These selection procedures 
have been introduced to overcome some limitations of the LASSO estimator and 
present a number of appealing properties in terms of both estimation accuracy as 
well as variable selection properties. In order to improve the performances of these 
procedures in high-dimensional settings, when the number of observations is very 
small, it may be desirable to use a bootstrap sampling technique which is able to per- 
turb an initial dataset to gain information from the multiple datasets resulting from 
the bootstrap procedure. This approach was suggested for the LASSO estimator by 
Bach (2008). 


2.2 The Bootstrap Group Penalty (BGP) 


We introduce a new penalized procedure which is obtained by coupling the proper- 
ties of group variable and bi-level selection methods with the bootstrap re-sampling 
methods. 

If the training data consists of n observations (X;,y;) € R? x R, i= 1,...,n, we 


consider B bootstrap replications of the n pairs (X;,y;). More specifically, for 
b = 1,...,B, we consider (X;;,yp;) € R? x R sampled uniformly at random with 
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replacement from the original training pairs (X;,y;), i= 1,...,n. Then at each boot- 
strap iteration we estimate the regression parameters f;, for j = 1,...,p, by using 
a penalized group or bi-level group selection procedure. At each bootstrap step we 
select the subset of covariates with non-zero coefficients selected by the penalized 
procedure specified by the set of covariates indices: 


IO dp b=1,...,B. 


Among all the B sets Jp, we select only the predictors which have high frequency 
out of the B bootstrap replications. The proportion of predictors to be taken into 
account depends strongly on the penalization method. The set of the variables se- 
lected will be then used to estimate the model using a penalized group procedure. 
The procedure is synthesized in algorithm 1. 


Algorithm 1 BGP procedure 


1: Input n sample size, B number of bootstrap replicates, (X;,y;) € R? x R, T threshold value to 
select relevant covariates in the bootstrap replications. 


2: for b= 1 to B do 

3: repeat 

4: Generate B bootstrap samples (Xpi, Ypi) E€ R? x R 

di Compute the estimates of the regression parameters by using a penalized procedure BP 
forj=1,...,p. | 

6: Compute J, = {j|B? #0, j=1,...,p} 

7: until 

8: end for 

9: Compute J = { j|which are present in at least T% of the Jp}. 


10: Compute B by using a penalized procedure on the restricted set J of covariates. 


3 A simulation study 


In this section, we perform a simulation study to evaluate some group penaliza- 
tion methods and their corresponding bootstrap counterparts. Among the several 
penalties that can be used here we focus on the group bridge (gBridge), the com- 
posite MCP (cMCP), the maximum concave penalty (MCP), the group exponential 
LASSO (gel) and the group LASSO (gLASSO), and on their bootstrap counter- 
parts, which will be refereed as BgBridge, BCMCP, BMCP, Bgel and BgLASSO, 
respectively. The simulation is based on the linear regression model y; = X;B + £;, 
i=1,...,n, where €; ~ N(0, 0?) as introduced in Equation 1. The standard devia- 
tion o is assumed to be 3 and covariates were generated from the normal distribution 
as in the simulation study proposed by Breheny (2015). We consider the following 
setup: 10 groups with 20 variables in each group, then the total number of pre- 
dictors is p = 200. The number of non zero coefficients is 4, and all the non-zero 
coefficients belongs to the same group. We randomly split the data into training and 
testing datasets where the training set is assumed approximately 50% of the full data 
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set. The data set size considered in this simulation is n = 100. Regarding the boot- 
strap samples, we used B = 500. To evaluate the performance of the various group 
penalization methods with respect to their bootstrap counterpart we calculate some 
measures of prediction accuracy and variable selection efficiency. In particular we 
repeat the simulation 1000 times and we compute the predictive mean square error, 
the sensitivity (the ratio between the number of selected important variables and the 
number of important variables), and the specificity (the ratio between the number of 
removed unimportant variables and the number of unimportant variables) as defined 
in Geng et al. (2015). The results of this simulation are presented in Table 1. 


Table 1 Simulation results over 1000 replicates: Predictive Mean Square Error (PMSE), Sensitiv- 
ity, Specificity. The statistical significant differences between the original approach and the boot- 
strap counterpart are reported in bold. 


Method PMSE Sensitivity Specificity 

gBridge 0.885 (0.176) 0.934 (0.122) 0.843 (0.025) 
BgBridge 0.835 (0.293) 1.000 (0.000) 0.907 (0.031) 
cMCP 0.936 (0.207) 0.797 (0.190) 0.849 (0.014) 
BcMCP 0.839 (0.150) 0.910 (0.136) 0.913 (0.014) 
MCP 1.281 (0.257) 1.000 (0.000) 0.489 (0.077) 
BMCP 1.038 (0.911) 1.000 (0.000) 0.899 (0.040) 
gel 0.949 (0.185) 1.000 (0.000) 0.462 (0.083) 
Bgel 0.755 (0.608) 0.9998(0.008) 0.937(0.044) 
gLASSO 0.924 (0.221) 0.554 (0.497) 0.916 (0.158) 
BgLASSO 1.293 (0.544) 1.000 (0.000) 0.652 (0.104) 


From this Table we can notice that the Bootstrap Group Penalty procedure is able 
to improve the PMSE and the Specificity in almost all the compared approaches as 
highlighted in bold in Table 1 (see for example, the very good performance in pre- 
diction of Bgel with respect to all the other approaches using different penalization 
methods, confirmed also for Specificity index). Moreover, when considering Sensi- 
tivity, we can notice that all the BGP procedures are able to increase or at least to 
confirm very good performances achieving values of Sensitivity very close to 1. 


4 Concluding remarks 


Our results suggest that combining penalized group and bi-level variable selec- 
tion methods with re-sampling methods is a promising approach to handle high- 
dimensional problems. The method could be easily adapted to handle other penal- 
ization procedures. Further works will be devoted to investigate the method in a real 
case study and in other simulations set-up. 


638 Valentina Mameli, Debora Slanzi and Irene Poli 


References 


[Akaike, 1974] Akaike, H. (1973). Information Theory and an Extension of the Maximum Like- 
lihood Principle. In: B.N. Petrov and F. Csaki (eds.) 2nd International Symposium on Infor- 
mation Theory: 267-81 Budapest: Akademiai Kiado. 

[Bach, 2008] Bach, F.R. (2008). Bolasso: Model Consistent Lasso Estimation through the Boot- 
strap. Proceedings of the 25-th International Conference on Machine Learning, Helsinki, Fin- 
land, 2008. 

[Breheny and Huang, 2009] Breheny, P., Huang, J. (2009). Penalized methods for bi-level variable 
selection. Statistics and Its Interface, 2(3), 369-380. 

[Breheny, 2015] Breheny, P. (2015). The Group Exponential Lasso for Bi-Level Variable Selec- 
tion. Biometrics, 71, 731-740. 

[Fan and Li, 2001] Fan, J. and Li, R. (2001). Variable selection via non-concave penalized like- 
lihood and its oracle properties. Journal of the American Statistical Association, 96, 1348— 
1360. 

[Fan and Lv, 2010] Fan, J., Lv, J. (2010) A Selective Overview of Variable Selection in High 
Dimensional Feature Space. Statistica Sinica, 20, 101-148. 

[Fu, 1998] Fu, W. J. (1998). Penalized Regressions: The Bridge versus the Lasso. Journal of Com- 
putational and Graphical Statistics, 7(3), 397-416. 

[Geng et et al., 2015] Geng, Z., Wang, S., Yu, M., Monahan, P.O., Champion, V., Wahba, G. 
(2015). Group variable selection via convex log-exp-sum penalty with application to a breast 
cancer survivor study. Biometrics, 71(1), 53-62. 

[Huang et al., 2009] Huang, J., Ma, S., Xie, H., and Zhang, C. (2009). A group bridge approach 

for variable selection. Biometrika, 9, 339-355. 

[Huang et al., 2012] Huang, J., Breheny, P., Ma, S. (2012) A Selective Review of Group Selection 

in High-Dimensional Models. Statistical Sciences, 27(4), 481-499. 

[Lee et al., 2012] Lee, Y.K., Lee, E.R., Park, B.U. (2012). Principal component analysis in very 

high-dimensional spaces. Statistica Sinica 22, 933-956. 

[Liu et al., 2015] Liu, J., Wang, F., Gao, X., Zhang, H., Wan, X., Yang, C. (2015) A Penalized 

Regression Approach for Integrative Analysis in Genome-Wide Association Studies. Journal 

of Biometrics and Biostatistics, 6(228), 7 pages. 

[Ogutu and Piepho, 2014] Ogutu, J.O., Piepho, H.-P. (2014) Regularized group regression meth- 
ods for genomic prediction: Bridge, MCP, SCAD, group bridge, group lasso, sparse group 
lasso, group MCP and group SCAD. BMC Proceedings, 8(Suppl. 5), S7. 

[Park et al., 2007] Park, M. Y., Hastie, T., Tibshirani R. Averaged gene expressions for regression. 
Biostatistics, 212-227. 

[Tibshirani, 1996] Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal 
of the Royal Statistical Society, Series B, 58(1), 267-288. 

[Sharma et al., 2014] Sharma, D.B., Bondell H.D., Zhang, H.H. (2013). Consistent Group Identi- 
fication and Variable Selection in Regression with Correlated Predictors. Journal of Compu- 
tational and Graphical Statistics, 22(2), 319-340. 

[Schwarz, 1978] Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics, 
6, 461-464. 

[Yuan and Lin, 2006] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression 
with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49-67. 

[Zhang, 2010] Zhang, C.-H. (2010) Nearly unbiased variable selection under minimax concave 
penalty. Annals of Statistics, 38(2), 894-942. 


Improving small area estimates of households’ 
share of food consumption expenditure in Italy 
by means of Twitter data 


Migliorare la precisione delle stime per piccola area della 
quota di spesa dei generi alimentari tramite i dati del 
social network Twitter 


Marchetti S., Pratesi M., Giusti C. 


Abstract In this work we use emotional data coming from Twitter as auxiliary vari- 
able in a small area model to estimate Italian households’ share of food consumption 
expenditure (the proportion of food consumption expenditure on the total consump- 
tion expenditure) at the provincial level. We show that the use of Twitter data has 
a potential in predicting our target variable, reducing the estimated mean squared 
error with respect to what obtained by the same working model without the Twitter 
data. 

Abstract In questo lavoro si mostra come l’uso di dati ricavati da Twitter possa 
migliorare in termini di efficienza le stime per piccole aree della quota di spesa per 
generi alimentari a livello provinciale in Italia. 


Key words: Big Data, Area level model, Emotional data 


1 Introduction 


Recently, an increasing number of researchers have investigated the value of us- 
ing big data (huge amounts of digital information about human activities) in socio- 
economic studies, see for example Eagle et al (2010); Blumenstock et al (2015); 
Decuyper et al (2014). Marchetti et al (2015) suggested three approaches to use big 
data in sinergy with small area estimation methods. Another approach to use big 
data in small area estimation was suggested by Porter et al (2014). 

In this paper we focus on the use of data coming from the social network Twitter 
to investigate their potential in predicting the share of food consumption expenditure 
of Italian households at the province level. The paper has the following structure: the 
description of the data used in the analysis is in section 2; the small area estimation 
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model is presented in section 3; the results of the application are detailed in section 
4. Finally, we draw some concluding remarks in section 5. 


2 Data used in the application 


The primary source of data on households’ expenditure in Italy is the Household 
Budget Survey (HBS) carried out annually by ISTAT. In 2012 the sample of the 
HBS was composed by approximately 28000 households. Data were collected on 
the basis of a two-stage sample design where the first stage were the municipalities 
and the second stage were the households. The regions (NUTS 2 level according to 
Eurostat) are the finest geographical level for which direct estimates of the target 
indicators are reliable. However, the knowledge of measures able to assess house- 
holds’ living conditions and well-being at a more detailed geographical level is often 
crucial, since this knowledge can for example enable policy makers in planning lo- 
cal polices aiming at reducing poverty and social exclusion (Giusti et al, 2016). 

The households’ consumption expenditure can be classified into food (and bev- 
erages) and non food expenditure. The share of total expenditure that an household 
dedicate to food items is an important indicator of the household living conditions: 
at risk of poverty households usually spend an higher share of their total expenditure 
on food with respect to the other households, with a lower impact of the share of 
expenditure dedicated to other resources and commodities. 

To estimate the target at the province level we resort to model-based area-level 
small area methods, since direct estimates are unreliable. As possible sources of 
auxiliary variables — needed in model-based estimation — we use data coming from 
the Population and Housing Census 2011 and from the Survey! on Social Actions 
and Services on Single and Associates Municipalities 2012. 

From the Population Census we collected information at provincial level such 
as the number of households, the average households’ size, the tenure status, the 
female-headed households quota. As the target variable of our analysis can be con- 
sidered as a proxy of the households’ living conditions, we also considered as valu- 
able source of auxiliary information the expenditure that Italian municipalities made 
in 2012 for interventions of social protection. These interventions includes the costs 
information on local welfare policies, such as services, benefits and transfers di- 
rected to households with children, old-age persons, poor and social excluded per- 
sons, immigrants. 

Besides these sources of official statistics, we also considered as a potential 
source of auxiliary information big data from Twitter. In particular, we considered 
here as potential covariate for our small area working models the iHappy indicator 
referring to the year 2012. The iHappy indicator is made available every year since 
2012 for all the 110 Italian provinces on the Opinion Analytics platform Voices 
from the Blogs. The iHappy indicator referring to the year 2012 was computed by 


' This survey is a census survey, although some nonresponses can occur. Here we ignore the non- 
responses and we use these data as census data. 
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collecting and coding more than 43 millions of tweets posted on a daily basis in all 
the Italian provinces. The words and emoticons of the tweets were classified using 
a training set in two categories: “happy” and “unhappy”, together with a residual 
class “other”. Then, Curini et al (2015) derived the frequency distribution of the 
happy and unhappy tweets in the entire population. The iHappy indicator was then 
computed for each Italian province as the percentage ratio of the number of happy 
tweets to the sum of happy and unhappy tweets. The overall average of the iHappy 
indicator in 2012 was equal to 44.5%, with a minimum value of 35.1% for Oristano 
and a maximum value of 56.6% for Sassari, both provinces of the Sardinia region. 
Indeed, the spatial variability of the iHappy values was rather high, as it is evident 
from the “emotional map” of Figure 1 (right). 


3 Short review of the Fay-Herriot model for small area 
estimation 


Data obtained from surveys are often used to estimate characteristics for subsets 
of the survey population. If the sample from a subset is small, then a traditional 
design-based survey estimator can have unacceptably large variance. These subsets 
has been defined as small areas (Rao and Molina, 2015). 

In this study the available data allow us to rely only on area-level models (relate 
small area direct estimates to area-specific auxiliary variables). In addition, we do 
not have time-series data and the spatial correlation of the target direct estimates 
is low. So our choice falls on the Fay and Herriot (1979) estimator (FH). In what 
follows a summary description of the method is given. 

Let m be the number of small areas and 6; be the target parameter of the area 
i (mean or proportion). A survey provides a direct estimator ĝdir of 6;, E[ògir (= 
6; under the sampling design. A p-vector X; contains the auxiliary data sources — 
exactly known — of population characteristics for area i. The FH model is as follows: 


ĝdir XP Bt+ujte; i=1,...,m, (1) 


where uj i N(0,02),i= 1,...,m are the model errors and e; w N(0, w?),i=1...,m 
are the design errors, with e; independent from uy; for all i and j. It is assumed that 
the quantity of interest in area i is 6; = XT B + ui. 

Under the assumption of normality of both the errors (model and sampling de- 
sign), the best linear unbiased predictor (BLUP) of 0; is 6/4 = yâ! ir + (1-y)XT È. 
Y% = 02 /(o2 + w?), where f is the Best Linear Unbiased Estimator of B. According 
to the theory of small area estimation (Rao and Molina, 2015), the parameters 8 and 
0? are unknown and must be estimated, while y? is assumed to be known. 

Estimators of B and 0? can be obtained using the restricted maximum likeli- 
hood from the marginal distribution ĝar ~ N(X! B, o2 + w?). By plugging in the 
estimates of B and o? into the BLUP we obtain the empirical best linear unbiased 
predictor 
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a2 
x » Adir x A a ô; 
Â = HOE +(1- WTB, = Eg? (2) 
u i 


The estimator (2) has the following MSE(6F#) = yy? + (1 — y)°X?V(B)Xi + 
wi(we t+ 02) -3V(02) = g1i+ 82; + g3;, where gz; is the contribution to the MSE 
from estimating B and g3; is the contribution to the MSE from estimating 07; v(B ) 
and V(02) are the asymptotic variances of an estimator È of B and an estimator 02 
of 02, respectively. An estimator of the MSE is as follows 


mse(6/") = 1; + 825 + 283:, (3) 


where ĝi: = yp, Bi = (1 — fi)? XP IL XXT /(We + PX, Bi = WW + 
67) 3207 1/(67 +7 T. 


4 Area-level small area model with and without Twitter data to 
estimate the share of food consumption expenditure in the 
Italian provinces 


In this section we show that the use of Twitter data can improve the precision of the 
Share of Food Consumption Expenditure (SFCE) estimates in the Italian provinces, 
obtained using small area methods. 

First, we estimated the SFCE at provincial level using the FH model (1) selecting 
the more predictive variables among the data described in section 2 without consid- 
ering the iHappy variable, the one computed using Twitter 2012 data. In this way we 
obtained a reduction in MSE in all the provinces. Second, we added the iHappy vari- 
able to the other auxiliary variables and we estimated the SFCE again. If the iHappy 
variable is linearly correlated with the SFCE and this relation is not yet explained by 
the other auxiliary variables, then we expect a better performance in terms of MSE 
when using iHappy. We will show that the results obtained support this expectation. 

The target variable, the SFCE, was obtained from the HBS 2012 survey as the 
ratio between the consumption expenditure for food (including beverages) and the 
total consumption expenditure. Its direct estimate at provincial level was obtained 
using the Horvitz and Thompson (1952) expansion estimator, ĝar ; 

In 2012 the Italian provinces were 110 in total. However, in 2012 no HBS sample 
data were available for the province of Enna (Sicily) therefore it was not possible 
to obtain a direct estimate for this province, so we computed a synthetic estimator 
given that we know the auxiliary data for this province. 

The selected auxiliary variables for the model without the iHappy variable 
are: the share of owners of the house x}, the share of households lead by a fe- 
male x2, the per-household local government expenses to support several cate- 
gories of citizens, households with children (x3), old-aged persons (x4), immi- 
grants (x5), at risk of poverty persons (x6), services to families (x7). So let X; = 
[L155 X21, X3i,X4i, X51, X6i,X7i]T be the design p-vector for model (1) for the area i, 
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where x;,;, k =0,...,p =7,i=1,...,m, is the value of the kth auxiliary variable in 
area i (with xo; = 1). 

The FH model without the iHappy variable is then ĝdir = XTB + ui + ei. Es- 
timates of B and o2 were obtained under the Normality assumptions made in 
section 3 using the restricted maximum likelihood (REML). From the analysis 
of û; = yop" — x B), the Normality assumption seems reasonable. Indeed, the 
Shapiro and Wilk (1965) Normality test is equal to 0.978 with a p-value of 0.063. 

To check the hypothesis that big data can help to increase the precision of the 
small area estimates - if used as auxiliary variables - we added the iHappy variable 
(xg), obtained from the analysis of Twitter data as explained in section 2, to the set 
of the selected auxiliary variables (x1,x2,...,x7). Let Z; = DX, xs], where xg; is 
the iHappy value for area i. The FH model is 6d" = Z! BP + uB? + eB, where the 
superscript ËP refers to parameters under the model that makes use of big data (the 
Twitter data). Point and mse estimates are then obtained according to the methodol- 
ogy described in section 3 (replacing X; by Z;). 

In both the models - with and without iHappy variable - we selected the auxiliary 
variables using a step-wise procedure based on AIC (Hastie and Pregibon, 1992). 
The selected variables show a negative linear correlation with the target that range 
from —0.130 to —0.509. The negative correlations were expected for all the variable, 
but the share of households lead by a female. In general, in Italy, households lead 
by a female are positively correlated with poverty indexes and deprivation variables. 
However, we can suppose that the households lead by a female are associated with 
a reduction of the household size, so the expenses in food and beverages decreases 
so that to increase the SFCE. This hypothesis is supported by a linear correlation 
between the share of the households lead by a female and the household size equal 
to —0.857. As done for the model without iHappy variable, we estimated B®? and 
oP under the Normality assumptions made in section 3 using the REML. The 
Shapiro and Wilk (1965) Normality test for 72s is equal to 0.980 with a p-value of 
0.107. 

The regression parameters estimated for both the models - with and without 
iHappy - are showed in table 1. The fs obtained under the two models are similar, 
the introduction of the iHappy variable in the FH model does not change signifi- 
cantly the model, it just add predictive power to it. The parameter 0, is estimated 


Table 1: Regression parameters of the FH model with/without the iHappy variable. 


PPP p-value? ĝßĝ p-value 
Intercept 0.7165 0.0000 0.6446 0.0000 
iHappy2012 —0.0019 0.0067 - - 
Share of owners of the house —0.0038 0.0000 —0.0039 0.0000 
Share of household lead by female —0.3164 0.0009 —0.3222 0.0012 
Expenses for household with children —0.0001 0.2121 —0.0002 0.0513 
Expenses for old-aged persons —0.0001 0.0123 —0.0001 0.0280 
Expenses for immigrants —0.0013 0.0003 —0.0013 0.0009 
Expenses for at risk of poverty persons 0.0006 0.0009 0.0007 0.0006 
Expenses for services to families —0.0005 0.0460 —0.0006 0.0246 
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equal to 0.020 for the model without iHappy and to 0.019 for the model with iHappy. 
To verify the null hypothesis that 02 = 0, we used the test proposed by Datta et al 
(2011) and we reject the null hypothesis 02 = 0 for both the models. 

It is important to highlight that the iHappy indicator is based on self-selected 
data, the Twitter data. However, in this application we are not able to treat the self- 
selection bias due to lack of information. Thus, we assume that the self-selection is 
negligible. Moreover, the iHappy indicator can be affected by measurement error, 
since not any happy tweet corresponds to a happy person. In our application the 
MSE of the iHappy is very small, due to the very large sample size (43 millions 
of tweets), therefore model that account the measurement error, such as the one 
proposed by Ybarra and Lohr (2008), approximately corresponds to the traditional 
FH model. 

Results on the SFCE estimates are summarized in table 2. Using the FH estimator 
(2) with the set of auxiliary variables X;s the rmse is reduced in all the provinces. 
The average reduction is about 30% with a 25% of provinces where the reduction is 
at least about 40%(table 2). Moreover, using also the iHappy variable the reduction 
of the rmse results in an average gain of 2%. A clearer picture of the gain in precision 
due to the introduction of the iHappy variable in the FH model can be see in the last 
line of table 2, which shows the efficiency of gee against 6F 4 There is a gain 
in all the areas, but one where we observe a loss of 0.5%. The gain goes from about 
2% up to about 7%. Given that the small area estimates obtained without the use of 
the iHappy variable show a remarkable gain in terms of reduction of mse, the further 
reduction of the mse due to the introduction of the iHappy variable in the model is 
a very good result. This is particularly important also because the iHappy variable 
can be computed every year, while updated census information on the population is 
not always available. 


Table 2: Summary of point estimates of SFCE and their efficiency. 


Min. Ist Qu. Median Mean 3rd Qu. Max. 


oP" (%) 15.38 19.44 21.34 22.45 25.56 35.42 
FH (%) 15.37 19.70 21.60 22.19 24.47 29.91 
@FH.BD (a) 15.44 19.68 21.64 22.17 24.65 29.55 


rmse(0FH)/rmse(6P")(%) 19.79 60.66 74.90 70.38 82.16 99.39 
rmse(0FH-BD)/rmse(0PÎ)(%) 18.44 58.29 72.37 68.49 80.35 99.43 
rmse(0FH-BD)/rmse(0FH)(%) 93.18 95.73 97.33 97.02 98.22 100.50 


i 


In order to obtain a clearer picture of the estimates across the country, we mapped 
them out in figure 1. In the same figure we contrast our estimates with the map of 
the iHappy variable to show the relationship between the two variables. The SFCE 
point estimate for the out of sample province of Enna has been computed using the 
regression synthetic estimator (see Rao and Molina, 2015). In particular, the esti- 
mated SFCE for the province of Enna is 25.29% with an rmse of 1.98%. These re- 
sults seem plausible according to the estimates obtained for the neighbors provinces. 


In Italy the SFCE is 22.2% at national level, showing that in average the con- 
sumption of food does not represent a large amount on total expenses for con- 
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Fig. 1: Map of the FH éstimates of the SFCE (left) and map of the iHappy variable 
for 110 provinces in Italy (right). 


sumption. At provincial level (table 2) the SFCE varies between 15.44% (Ravenna, 
central Italy) and 29.55% (Caserta, southern Italy), so there is evidence of spatial 
heterogeneity. About a quarter of the provinces have an SFCE > 25%. All these 
provinces are in the southern part of Italy. Nine provinces have an estimated SFCE 
that is below 18%, five of these provinces are in the central part and the other four 
are in the northern part of Italy, confirming the well known Italian north-south di- 
vide. 
For a more detailed description of this application see Marchetti et al (2016) 


5 Conclusions 


In this paper we focused on the iHappy indicator obtained from the analysis of 
Twitter data. The data consist of all the geo-referenced tweets posted in 2012 in the 
Italian provinces, classified by Curini et al (2015) as the percentage of happy tweets 
to the total of tweets at provincial level. 

In our analysis the iHappy indicator resulted a good additional covariate to pre- 
dict households’ SFCE, given the net influence of other covariates characterizing 
the provinces, such as the tenure status of the house, the gender of the head of the 
households, the level of the expenses of the local government to support vulnerable 
groups. 

In Italy the SFCE shows a territorial variability that mimics that of many socio- 
economic indicators: in 2014 the north-eastern and north-western part of Italy had 
the lowest level of SFCE (respectively 15.7% and 15.5%) while the southern part 
(islands included) had the highest (21%). This north-south divide is evident also 
from the territorial distribution of the iHappy indicator, with few exceptions (some 
provinces of Sardinia, Puglia and Sicily). 

Concluding, the iHappy indicator on happiness can provide useful covariates on 
yearly bases, free of charge and broken by provinces. It comes affected by self- 
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selection bias and measurement error. In this application we assumed that the self- 
selection is negligible and that the measurement error appears to be a minor issue. 
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Gross Annual Salary of a new graduate: is it a 
question of profile? 


Retribuzione Annua Lorda di un neo-laureato: dipende 
tutto dal profilo? 


Paolo Mariani, Andrea Marletta and Mariangela Zenga 


Abstract The paper aims to identify an ideal profile for the new graduates in the re- 
cruitment process. Moreover, the distribution of their gross annual salary is analyzed 
in relation to a selected profile. The analysis is based on the Education-for-Labour 
Elicitation from Companies’ Attitudes towards University Studies Project using the 
methodology of a Conjoint Analysis. The data refers to 471 enterprises operating in 
Lombardy in different economic sectors with particular focus on the tertiary sector. 
Abstract II lavoro si propone di individuare un profilo ideale dei neo laureati nel 
processo di selezione. Viene inoltre presa in considerazione la distribuzione della 
retribuzione annua lorda del neo assunto in relazione ad uno specifico profilo se- 
lezionato. L’analisi si basa sulla ricerca Education-for-Labour Elicitation from 
Companies’ Attitudes towards University Studies utilizzando la metodologia della 
Conjoint Analysis. I risultati sono relativi a 471 imprese operanti in Lombardia, sec- 
ondo diversi settori economici, con particolare attenzione al settore del terziario. 


Key words: New graduates, Conjoint Analysis, Utility Score, Gross Annual Salary. 
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1 Introduction 


This work concerns the comprehension policies about relationships between the 
enterprises and universities, with reference to the labour market for the new grad- 
uates. In particular, the study is based on the multi-centre research, Education-for- 
Labour Elicitation from Companies’ Attitudes towards University Studies (Electus) 
[3], a research project involving 6 Italian universities. The aims of Electus are var- 
ious. Firstly, it focus on the identification of an ideal graduate profile for several 
job positions. Secondly, it works toward some across-the-board skills, universally 
recognized as ’best practices’ for a graduate. Finally, the analysis allows to achieve 
differences and valuations between wage and competencies for new graduates. 

The paper is organized as follows. The first section contains the presentation 
of the statistical method (Conjoint Analysis). The Mariani-Mussini coefficient of 
economic valuation is introduced in the second section. The results of Electus survey 
is showed in the third section. Finally, conclusions and main remarks are discussed 
in the last part of the paper. 


2 Methodology: the Conjoint Analysis and the coefficient of 
Economic Variation 


Conjoint analysis (CA) is a technique widely used to investigate consumer choice 
behaviour [4]. In particular, in this study CA refers to the stated preference model 
used to obtain part-worth utilities. The aim of this model consists in estimating a 
utility function (UF) for the characteristics describing several profiles. The UF is 
defined as follow: 


Up = L BsXsk (1) 
s=0 


where xox is equal to 1 and n is the number of all level of attributes which define the 
combination of a given profile, x,y is the dummy variable that refers to the specific 
attribute level. As a result, the utility associated with k alternatives (Ux) is obtained 
by summing the terms psx, over all attribute levels, where ps is the partial change 
in U% for the presence of the attribute level s, holding all other variable constants. 
Usually when CA is performed, all respondents answer to every possible profile. In 
this experiment the possible profiles obtained from combining every level in a full 
factorial fashion were so numerous, so it was necessary to apply an ad-hoc fractional 
factorial design. According to several criteria [3], an individual random sample of 
four profiles was administered to each respondent which had to mark them on a scale 
of 1 to 10. This experimental final design results both orthogonal and balanced. 
Part-worth utilities of levels obtained from non-standard CA represents the start- 
ing point to re-evaluate the proposed Gross Annual Salary of the job vacancies. 
Secondly, economic re-evaluation will be carried out through relative importance of 
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attributes in non-standard CA using Mariani-Mussini coefficient of economic valu- 
ation [5]. The general formulation of the coefficient is: 


U; —U, 
—— * 
Up 


where U; is the sum of part-worth utility scores associated with the profile i, 
U, the sum of utility scores associated with a baseline profile and J; is the relative 
importance for the attribute j. 

The coefficient M/;; could be also used for estimating the variation in terms of 
the salary associated to profile i compared to the baseline one. Given the salary 
associated with the baseline profile 7, the coefficient of economic re-evaluation can 
be expressed as: 


MI = Ij (2) 


Variations V;; change in proportion of the /;, this entails two basic considerations. 
Firstly, when an attribute has a very high value of importance, V;; assumes higher 
variations. Secondly, V;; concern attribute variations one at a time, that is to say 
profile comparisons are possible only varying an attribute, holding fixed all others. 
Moreover, if the baseline profile is the best/worst one, all coefficients M/;; and all 
variations V;; will be negative/positive. 


3 Application 


The survey was conducted in 2015 using CAWI technique. Data were collected us- 
ing a software program called Sawtooth [6]. Data manipulation and Conjoint Anal- 
ysis were obtained using R software and Conjoint package [1]. 

The questionnaire contained two sections: conjoint experiment for the five job 
positions and general information about the company (demographic questions). Re- 
garding the five job positions for the new graduates, Administration clerk, HR as- 
sistant, Marketing assistant, ICT professional and CRM assistant were considered. 
To specify the candidates’ profile, six attributes were used: 


e Field of Study with 10 levels (Philosophy and literature, Educational sciences, 
Political science/ Sociology, Economics, Law, Statistics, Industrial engineering, 
Mathematics/ Computer sciences, Psychology, Foreign languages), 

e Degree Mark with 3 levels (Low, Medium, High), 

e Degree Level with 2 levels (Bachelor, Master), 

e English Knowledge with 2 levels (Suitable for communication with foreigners, 
Inadequate for communication with foreigners), 

e Relevant work experience with 4 levels (No experience at all, Internship during or 
after completion of university studies, Discontinuous or occasional work during 
university studies, One year or more of regular work) 
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e Willigness to Travel on Business with 3 levels (Unwilling to travel on business, 
Willing to travel on business only for short periods, Willing to travel on business 
even for long periods). 


After having rated the selected profile and chosen the best one, the entrepreneurs 
had to propose a Gross Annual Salary for the chosen profile in order to measure the 
so-called willingness to pay’ [2]. 

As far as the Milano-Bicocca research unit is concerned, interviewees were rep- 
resentatives of companies registered on the Portal of Almalaurea for recruitment 
and linkage, limited to the university site. Final respondents were 471. Companies 
profile shows that they were in prevalence (52%) small sized (15-49 employers), 
followed (25.6%) by medium sized, ranging from 50 to 249 employees and (22.4%) 
by the large companies with 250 employers or more. The most represented activity 
sectors were services to the industry (62.1%), services to the person or the family 
(16.2%) and manufacturing (14.9%). The majority of companies (89.4%) operated- 
fully or partiallywithin the domestic market. Moreover, they were mainly under the 
management of the entrepreneur (63%). 

Five CAs are achieved corresponding to the different job positions in order to 
measure entrepreneurs’ preferences. Results for path-worth utilities are similar for 
all the attributes, except for the attribute Field of Study. This means that all other 
competencies have some levels that are universally identified as ’best practice’ for a 
graduate. 

The attributes named Relevant work experience and English Knowledge show 
always the same level as best for each vacancy. After all, it is easy to imagine that 
companies prefer to employ a candidate with one year or more of regular work and 
suitable for communication with foreigners. Variables Degree Mark and Willingness 
to travel on business are competencies where best two levels are always preferred, 
so a medium-high marked degree and the willing to travel on business for short or 
long periods are preferable among candidates. 

Utility scores for variable Degree level are very close to 0 for each position, this 
means that there is no substantial difference for a bachelor or a master degree for a 
graduate. 

The attribute Field of Study is more complex to analyse since it is less cross and 
a degree in a field could result the best for a position and the worst for an other 
one. For this reason, in this paper only variations about Field of Study are taking 
into account. This allows to make a comparison of the coefficient of economic re- 
evalutaion MI;; and its associated variation V;; through different job positions. 

In Fig. 1 part-worth utilities for Field of Study attribute from 5 non-standard CA 
are shown for each job position. Economics studies represents the best profile con- 
sidering 3 job positions, while a degree in Psychology and Mathematics/ Computer 
sciences optimizes utility respectively for HR Assistant and ICT Professional. 

The contribution of these part-worth utilities is very relevant for total utility U;, 
since variable Field of Study has the higher relative importance of attributes J; for 
each job vacancies. The quota of explained J; ranges from a minimum of 42.98% for 
a position in customer relationship management (CRM) to a maximum of 80.54% 
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Fig. 1 Part-worth utilities for job position and field of study 


for a vacancy in ICT technical positions. Importance for other attributes is always 
under 20% for each position. 

In this application, baseline profile is the best profile which optimizes the total 
utility, so U is the sum of the highest part-worth utilities (plus an intercept) for each 
attribute j. This means that all MI;; coefficients and all variations V;; are negative. 

Table 1 shows M;; coefficients of economic re-evaluation for Field of Study, as 
expected each M;; < 0 and M;; = 0 only in correspondence of the best Field of 
Study for vacancy. Comparing the job positions, ICT professionals displays higher 
coefficients. This is due to the fact the /; is very high and a degree in Mathematics/ 
Computer sciences is fully specialized for this position. The biggest coefficient is for 
a degree in Foreign languages for ICT professionals, in comparison with a graduate 
in Mathematics/ Computer sciences they earn an halved Gross Annual Salary. 
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Table 1 Mj; coefficients for Field of Study 


Field of Study\ Job position AC HR ICT MKT CRM 
Philosophy and literature 18.27% —10.97% —45.96% —12.35% —10.84% 
Educational sciences 16.68% —5.23% —38.39% —13.24% —8.41% 
Political science/ Sociology 10.63% —10.76% —47.12% —11.03% —5.61% 
Economics ——— ERA -33.70% --- SS 
Law 12.17% —7.83% —48.68% —15.71% —7.60% 
Statistics 9.63% —17.79% —32.48% —11.41% —8.13% 
Industrial engineering 16.37% —24.60% —26.29% —14.70% —6.65% 
Mathematics/ Computer sciences —9.68% —21.21% ——— —14.82% —6.80% 
Psychology —19.86% 50.40% —10.47% —8.04% 
Foreign languages 13.40% —14.13% —51.39% —9.24% —7.67% 


4 Conclusion and future research 


The article proposes the use of a non-standard Conjoint Analysis in detecting best 
profiles for graduates using data from the Electus project. Moreover, a new eval- 
uation of the Gross Annual Salary is proposed using the Mariani-Mussini index 
derived from utility scores. Analysis deriving from 5 different job positions show 
how all graduates’ competencies are across-the-board, except for Field of Study. En- 
glish knowledge, medium or high level for degree mark, relevant work experience 
and willingness to travel are the most important required attributes for a graduate. 
About Field of Study, a degree in Economics seems to be the most attractive for en- 
trepreneurs, except for very specialized vacancies as ICT professionals, where other 
degree courses exhibit an halved salary respect to Computer Sciences graduates. 
Future research will focus the attention on results coming from stratified CA 
based on socio-demographic features of companies responding in the Electus project. 
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Dynamic random coefficient based drop-out 
models for longitudinal responses 


Modelli a coefficienti casuali dinamici per risposte 
longitudinali affette da drop-out non-ignorabile 


Maria Francesca Marino and Marco AIf6 


Abstract We propose a dynamic random coefficient based drop-out model for the 
analysis of longitudinal data subject to potentially non-ignorable drop-out. The pres- 
ence of a non-ignorable missingess may severely bias inference on the observed 
data. In this framework, random coefficient based drop-out models represent an 
flexible approach to jointly model both longitudinal responses and missingess. We 
extend such an approach by allowing the random parameters in the longitudinal data 
process to evolve over time according to a non-homogeneous hidden Markov chain. 
The resulting model offers great flexibility and allows us to efficiently describe both 
between-outcome and within-outcome dependence. 

Abstract Gli studi longitudinali sono spesso caratterizzati dalla presenza di dati 
mancati dovuti ad alcuni individui che lasciano lo studio anticipatamente. Quando 
il meccanismo che conduce al dato mancante è non ignorabile, è possibile giungere 
a conclusioni inferenziali valide solo modellando congiuntamente due outcome: il 
processo longitudinale ed il processo generatore del dato mancate stesso. A questo 
scopo, si propone un modello di regressione per dati longitudinali soggetti ad a 
drop-out potenzialemente non ignorabile in cui coefficienti casuali tempo-constanti 
e tempo-variabili vengono congiuntamente presi in considerazione. Questo perme- 
tte di modellare in maniera opportuna sia la dipendenza esistente tra le misurazioni 
di ripetute di uno stesso outcome per una stessa unit statistica, sia la dipendenza 
esistente tra outcome diversi. 


Key words: Hidden Markov models, nonparametric maximum likelihood, random 
effects, missingness, 
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1 Introduction 


Longitudinal studies are frequently affected by drop-out. If the selection of individ- 
ual staying in the study still depends on (future) unobserved responses once condi- 
tioning on the observed data, the missingness mechanism is said to be non-ignorable 
[9]. In this respect, to obtain valid inference, missingness should be taken in explicit 
account. 

Different alternatives are available in the literature to deal with non-ignorable 
drop-outs [8]. Among them, random coefficient based drop-out model [RCBDM - 
7] represent an interesting approach. They allow for the presence of two different 
sets of individual-specific random parameters for the longitudinal and the missing 
data process, respectively. These capture the dependence between repeated measure- 
ments from the same individual (within-individual dependence). The corresponding 
joint distribution provides instead a measure of dependence between the longitudi- 
nal and the missingness process (between-outcomes dependence). 

When dealing with longitudinal data, the assumption of time-constant, individual- 
specific, sources of unobserved heterogeneity may be too restrictive [1]. Starting 
from the proposal by [10], we introduce a dynamic random coefficient based drop- 
out model, where time-varying random parameters are considered to model the lon- 
gitudinal outcome. To explain our proposal, we assume that the dependence between 
the longitudinal and the missing data process is captured by an individual-specific 
upper-level mixture. Also, to describe the dependence within profiles, we consider 
two further sets of random parameters. For the longitudinal outcome, individual- 
specific, time-varying, random parameters that evolve over time according to a non- 
homogeneous hidden Markov chain are exploited. On the other hand, for the missing 
data outcome, we consider individual-specific, time-constant, random parameters 
identifying non-homogeneous propensities to stay into the study. 

The proposed model is applied to the Leiden 85+ dataset where the effect of 
demographic and genetic factors on the evolution of cognitive functioning in elder 
people is of main interest [3]. Due to poor health conditions or death, individuals 
enrolled in the study may present incomplete sequences. We show how the pro- 
posed model specification may be fruitfully exploited to derive valid inference on 
the parameters of interest. 


2 Motivating example: the Leiden 85+ study 


The Leiden 85+ study is a longitudinal study conducted by the Leiden University 
Medical Centre in the Netherlands, with the aim at analysing the evolution of cog- 
nitive functioning in the elderly. The study entails Leiden inhabitants who turned 
85 years old between September 1997 and September 1999. The sample is made by 
541 elderly who were followed for six consecutive yearly visits until they reached 
90 years of age. Patient conditions were assessed via the Mini Mental Status Exam- 
ination [MMSE, 6] index taking values between 0 and 30 with higher values corre- 
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sponding to better cognitive skills. The aim of the study is that of identifying demo- 
graphic and genetic factors influencing the dynamics of cognitive functioning and 
healthy aging. To this purpose, the following covariates were measured: age, gender, 
educational status, and APOE genotype. The latter identifies the Apolipoprotein E 
genotype of the patient; in particular, €4 allele is known to be linked to the risk of 
dementia. Due to the design of the study, a number participants present incomplete 
responses (i.e. drop-out), because of poor health conditions or death. 

Preliminary analysis show that MMSE values generally reduce with time but 
such a trend is more evident for subjects dropping out prematurely. Such a finding 
poses the question on whether the process leading to missing data may be ignored. 
In the next section, we will introduce a dynamic RCBDM to account for both the 
potential dependence between the longitudinal data process and the drop-out mech- 
anism and the within-profile dependence. 


3 The dynamic RCBDM 


Let us suppose a longitudinal study is designed to collect measures for a response 
variable Y;;,i= 1,...,n,t = 1,...,7, on a sample of n individuals at T time occa- 
sions and let Y; = (Y;1,...,Y;7) denote the vector of individual response sequences. 
As it it is frequent when dealing with longitudinal studies, some individuals in the 
sample may drop-out prematurely and, thus, may present incomplete sequences. In 
this framework, let Rj = (Ri1,... Rire) indicate the 7;*-dimensional missing data 
vector, where 7;* = min(T; + 1,7) and 7; denotes the number of available measure- 
ments for the i-individual. Ri, is defined as a binary variable with R; = 0 if the 
i-th individual drops-out from the study between occasion 1 — 1 and t and R, = 1 
otherwise. 

Furthermore, let Zy € {1,...,G} and U; € {1,...,K} be two individual-specific, 
discrete, latent variables influencing the longitudinal and the missing data process, 
respectively. As it is clear, while the latter variable is assumed to depend on the 
individual i only, the former variable is individual- and time-specific. This allows 
us to capture sources of unobserved dynamics that influence Y; and that would be 
barely captured by a time-constant latent variable. 

We assume that the longitudinal outcome Y; only depends on the corresponding 
latent variable Z; and, conditional on the vector Z; = (Zj,...,Zjr), the elements of 
Y; are independent, with joint (conditional) density given by 


T 
f(y; | Zi = zi) = [ [f Ou | Zi = zi). 
t=l 


Similarly, we assume that conditional on the latent variable U;, missingness indica- 
tors are independent and the corresponding joint (conditional) density is 
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T: 
f(x; | Ui = ui) = [ [f (ru | Ui = ui). 
t=1 


To describe the effect of the observed covariates on the outcomes (Yj, Ri), the 
following regression models are also defined: 


s[E(Vir | Zi = 8)] = Se + xP, 
logit[Pr(Ri =0|U; = k)] = & + w; 


it Y- 


In the expressions above, g(-) represents an appropriate link function, while the 
parameters f and y describe the effects of observed covariates, x; and w;, on Y; and 
Riz, respectively. Also, ,,g = 1,...,G, denotes the value of the random intercept 
in the longitudinal data model when Z; = g. To simplify the interpretation of such 
parameters, we introduce the following ordinal constraint: 


G<- <b, (1) 


so that lower values of Z;; correspond to lower values for the longitudinal response. 
Last, &,k = 1,...,K, denotes the discrete random intercept associated to the miss- 
ing data process when U; = k. 

Following an approach similar to that suggested by [10], we model the depen- 
dence between Z; and U; and, therefore, between the longitudinal and the missing 
data process, by considering a discrete upper-level latent variable, Vi, defined on the 
support set {1,...,H}, with t = Pr(V; = A), h=1,...,H. In particular, we assume 
that, conditional on V; = h, the latent variables Z; and U; are independent with joint 
distribution described by the following (association) model: 


H 
f(Zi,U;) = LV Th [Pr(Z; = Zi | Vi = h) Pr(U; = Ui | V; = h)| 5 
h=1 


With the aim of accounting for time-varying sources of unobserved heterogeneity 
influencing the longitudinal data process, we assume that, conditional on the h- 
th component of the upper-level mixture, that is conditional on V; = h, the latent 
variables Zi; evolve over time according to a first order hidden Markov chain with 
initial probability vector ô, and transition probability matrix Q,, with h = 1,...,H. 


3.1 Reducing model complexity 


As it can be noticed, the adopted parameterization is quite complex. This could 
lead to numerical difficulties when deriving the corresponding maximum likelihood 
estimates. In order to reduce the number of parameters, we follow an approach sim- 
ilar to that by [4] and specify 6, and Q, via a global logit parameterization. This 
choice is motivated by the constraints specified in equation (1) which, in turn, lead 
to considering the latent variable Z; having as ordinal. In this framework, initial 
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probabilities of the hidden Markov chain are defined according to the model 


Sein + wee Sain 
ija +*+d;-11h 


= Og + Won; (2) 


with 4 = 1,...,H and g =2,...,G. For identifiability purposes, we set Wo1 = 0, so 
that the number of parameters to be estimated reduces to (G— 1) + (H — 1). On the 
other hand, transition probabilities are modelled according to the following ordinal 
logit: 


Igglh ++ @Ggh 
Digilht:*+0g-1g|h 


log = Ugg! F Win, (3) 


with h=1,...,H, g=1,...,G, and g’ = 2,...,G. As above, to ensure parameter 
identifiability, we set wi; = 0, so that G(G— 1) + (H — 1) parameters need to be 
estimated. 


4 Model inference 


Let 0 denote the vector of all model parameters. Estimation of such parameters can 
be carried out via a maximum likelihood approach. Due to the local independence 
assumption within and between the longitudinal and the missing data responses, Y; 
and R;, inference may be based on the following observed data likelihood: 


n H 


T; T 
L(@)=[Jyuny Y Iho: | Zit = zit) dn [Tal | x 
7 i=! 1=2 


i=1 Zi Zir, 


T% 
x Mesa | U; = ui) run ) , 


t=1 uj 


To avoid multiple summations over all possible realisations of the hidden chain, 
Zil,- -, ZiT, we may rely on the EM algorithm [5]. 

In this framework, two separated steps need to be alternated. In the E-step, we 
need to compute expected value of the complete data log-likelihood, conditional 
on the observed data and the current value of parameter estimates. Such a compu- 
tation can be consistently simplified by extending the standard forward-backward 
algorithm [2] which is typically used in the hidden Markov model framework. In 
the M-step, we need to maximize the expected value of the complete data log- 
likelihood with respect to model parameters. The E- and the M-steps are iterated 
until convergence. As it is frequent when dealing with discrete latent variables, to 
avoid local maxima or spurious solutions, we may consider a multi-start strategy 
based on both deterministic and random solutions. Also, the number of upper- and 
lower-level components/states is treated as fixed and known. The algorithm is run 
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for varying choices of (G,K,H) and the best model is chosen via standard model 
selection techniques. 
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Hidden Markov models: dimensionality 
reduction, atypical observations and algorithms 


Modelli Markoviani latenti: riduzione dimensionale, dati 
anomali e algoritmi 


Antonello Maruotti and Jan Bulla 


Abstract We develop a new class of parsimonious models to perform time-varying 
clustering and dimensionality reduction in a time-series setting, while accounting 
for atypical observations. The problem of similarity search in time-series data is 
addressed by specifying a hidden Markov model. For the maximum likelihood esti- 
mation of the model parameters, we outline an ad-hoc Alternating Expected Condi- 
tional Maximization (AECM) algorithm. As the inclusion of covariates in the model 
might alter the formation of the latent states and that parameter estimation could 
become infeasible with large numbers of time points and covariates, we firstly con- 
struct the observed model, then obtain the latent state classifications, and subse- 
quently study the relationship between covariates and latent state memberships. 
Abstract Jn questo lavoro, introduciamo una nuova classe di modelli parsimo- 
niosi per classificare e ridurre la dimesione dello spazio delle variabili in un con- 
testo di serie storiche, tenendo conto nel processo di stima di dati anomali. Un 
modello Markoviano latente specificato per classificare le serie storiche. Per la 
stima di massima verosimiglianza dei parametri del modello, definiamo un algo- 
ritmo ad hoc di tipo AECM. L’inclusione di covariate nel modello potrebbe alterare 
la formazione degli stati latenti. La stima dei parametri potrebbe diventare com- 
putazionalmente complessa. Abbiamo, perci, costruito un modello per la parte os- 
servata e, successivamente, ottenuto la classificazione delle osservazioni nei vari 
stati latenti; infine studiamo la relazione tra covariate e stati latenti. 


Key words: Factor model, AECM, Three-step algorithm, Contaminated Gaussian 
distribution 
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1 Introduction 


Hidden Markov models (HMMs) are state of the art in the analysis of time- 
dependent data. These models have been applied in time series analysis for more 
than four decades [1]. Being dependent mixture models, HMMs allow to unambigu- 
ously recover of the structure of the data by rigorously defining homogeneous latent 
subgroups; simultaneously, they provide meaningful interpretation of the inferred 
partition. Nowadays, Gaussian HMMs are commonly used for clustering continu- 
ous data (see, e.g., [2]), although some robust (conditional) distributions have been 
recently proposed in the literature [4, 8]. 

Real data are often contaminated by outliers, spurious points or noise (collec- 
tively called atypical observations herein, that may affect both the parameters esti- 
mates and the ability to recover the latent structure. Despite the wide literature on 
robust estimation of mixture models, there are not many papers dealing with robust- 
ness issues in HMMs [3, 5]. 

The challenge of modeling multivariate time-series and their interactions is fairly 
common to all analyzes of high dimensional data with many variables of interests. 
Dimensionality-related aspects present a challenge, because these time-series could 
be potentially highly correlated. Therefore, estimation and interpretation of the pa- 
rameters of interest may become non-trivial. In order to examine the interrelation- 
ships between time-series and to perform dimensionality reduction in the variable 
space simultaneously (allowing for an easy interpretation of model parameters), we 
propose the use of a latent factor model. Accordingly, we define a general class of 
parsimonious HMMs by imposing a factor decomposition on state-specific covari- 
ance matrices. The loadings and noise terms of the covariance matrix may be con- 
strained to be equal or unequal across latent states. In addition, the noise term may 
be subject to further restrictions, resulting in a set of eight parsimonious covariance 
structures [6, 7]. 

At last, in order to characterize transitions between hidden states along with es- 
timating the effects of observed covariates on the transitions, we use a multinomial 
logistic regression model, which is capable of revealing the heterogeneity in the 
transition process. 


2 Methodology 


2.1 Notation and assumptions 


Let {Y,,t = 1,...,T} denote sequences of multivariate longitudinal observations 
recorded on T times, where Y, = (Yas. Yip) € RP, and let {Sgi=1,....]bt = 
1,...,T} be a first-order Markov chain defined on the state space {1,...,k,...,K}. 
A HMM is a particular kind of dependent mixture. It is a stochastic process con- 
sisting of two parts: the underlying unobserved process {S,}, fulfilling the Markov 
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property, i.e. 


Pr (S; = st | Sy = 81,82 =52,...,S;-1 St-1) Pr (S; Se | Spt St-1)5 


and the state-dependent observation process {Y,} for which the conditional inde- 
pendence property holds, i.e. 


fX =y ly: =y- Ye =YoS1= 51,..-,S; St) F(X Ye | S si), 


where f (-) is a generic probability density function. 
The hidden Markov chain has K states with initial probabilities 


Tig =Pr(S1 =k), k=1,...,K, 


and transition probabilities 


T; gj = Pr(S; =k | S1 HP) P= Bead T; j,k Wye eK (1) 


In (1), k refers to the current state, whereas j refers to the one previously visited; 
this convention will be used throughout the paper. Assuming that the hidden process 
follows a first-order Markov chain is equivalent to the assumption that any latent 
variable S, given S;—1 is conditionally independent of $1,52,...,S;—2. This depen- 
dence structure is seldom considered restrictive, and, due to its easy interpretation 
usually preferred to more complex structures of the latent variables. 

The hidden Markov chain has K states, labeled from 1 to K, with initial proba- 
bilities 

me =Pr(Sı =k), k=1,...,K, 


and transition probabilities 


(Tj =Pr(S =k| S1 =j), £22 oT HA, 2K 


Note that k refers to the current state in the above definitions, whereas j refers to 
the one previously visited; this convention will be used throughout the paper. More- 
over, the initial probabilities are collected in the K-dimensional vector 7, whereas 
the K x K transition probability matrix ,II contains the time-varying transition prob- 
abilities. The simplest model in this framework is the homogeneous HMM, which 
assumes time-homogeneous transition probabilities, i.e. independence of t and thus 
tI = II. This specification fails to take into account how atmospheric observed con- 
ditions affect the evolution of unobserved exposure states and, in general, time het- 
erogeneity of the transition probability matrix. In order to overcome this potential 
drawback, the transitions probabilities may be parametrized as a function of P ex- 
ogenous covariates x, = {x;1,...,x,p} by 


exp(X; Vik + Yjxo) 
1+ DK) exp(x{Yin + Yjno) 


(2) 


tTkjj = 
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where Yjk = {Yi Yap} represents a vector of fixed regressors and Yjxo is an 
intercept term. To ensure identifiability, we impose yj; = 0 and Y;jo = 0 for j = 
1,...,K. Accordingly, the probability of no transition at time ¢ is given by 

1 


1+ Lf, exp(x/ Yin + Yjno) 


tTj 7 


This model specification permits to investigate the dynamics of the hidden state 
sequence over time, and allows for a potential impact of covariates on its evolution. 


2.2 The contaminated factor HMM 


As concerns Y;|S; = k, in the fashion of [8], we assume a contaminated Gaussian 
distribution 


fon (Yi | Sp = k; Me, Ek, Og, Ne) = (3) 
aro (Y; | S; = k; lk, £k) + (1 — 0%) $ (Y; | Sp = k; tx, mx), 


where @ (-; Ug, X) denotes a P-variate Gaussian distribution with mean u and co- 
variance matrix Ly, Œ% € (0, 1) is the proportion of good points in state k, and ny > 1 
is an inflation parameter denoting the degree of bad points contamination in state k. 
In symbols, Y, |S; = k ~ CNp (Uk, Ek, Uk; Ng). 

In order to allow for dimension reduction and parsimony, we further assume a 
contaminated Gaussian factor analyzers model for Y,|S, = k. Such a model postu- 
lates 

Y,|(S; = k) = u + AkUirk + Cire, (4) 


where U; is a Q-dimensional (Q < P) vector of latent factors, Ay is a P x Q ma- 
trix of factor loadings, and e;;x is the error term. The contaminated Gaussian factor 
analyzers model generalizes the corresponding Gaussian model by assuming 


Y, |S; =k i 
( Di ) Ti CNp+9 (UK, Ex; OKs Ma) (5) 
where 
* Uk g: AA +B Ax 
Mx = P and Y= ( A cl 6 


To further analyze the model it is useful to introduce the dichotomous variable 
Vig assuming value 1, with probability 0, if observation i at time 1 in state k is a 
good point and zero, with probability 1 — x, if it is a bad point. Thus, for good and 
bad points we have 


& |S; = ‘) 
Ux 


reer? Y, |S; =k 
Vitk=1~Np+o (My, XX) and ( ee ) 
itk 


Vik =0~ Np+o (UK, ME) » 
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respectively, where Ng (u,&) denotes a B-variate Gaussian distribution with mean 
vector u and covariance matrix Y. Thus, 


Y,|S = k, Vik = 1 ~ Np (Uk, AA + Be); 
Y, |S: = k, Vik = 0 ~ Np (Me, Mk (AA +'Pe)] , 
UielVix = 1 No (00:10), 
Ul Vix =9 ~ No (00, mI) ; 
eink |Virk = 1 ~ Np (Op, Yx), 
eink Vik =O ~ Np (Op, Px), 


so that 


Y, |S; = k ~ CNP (Uk, McA + Yr, Ok, Nk) » 
U;k i CNo (00,10, O&%, Nk); 
Citk Y CNp (Op, Yr, Ok, Nk); 


where Y; = diag (Wa, - ey Wap Wp). 


2.3 Parsimonious HMCNFA family 


In this section, in the fashion of [6], we extend the hidden Markov of contaminated 
Gaussian factor analyzers model by allowing additional constraints across states on 
the Ax and ‘Py matrices and on whether or not Py = yxIp (isotropic constraint). The 
full range of possible constraints provides a family of eight different parsimonious 


hidden Markov of contaminated Gaussian factor analyzers models, which are given 
in Table 1. 


Table 1 Parsimonious structures derived from the hidden Markov of contaminated Gaussian factor 
analyzers model. 


Identifier Aj,...,Ax WP,,..., Pg  Isotropicity on W LZ # of free parameters for Z1,...,Zx 
UUU  Unconstrained Unconstrained Unconstrained Lx = AA + Yk K([PQ—Q(Q-1)/2]+ KP 
UUC Unconstrained Unconstrained Constrained Lx = AA, + Wil K[PO-Q0(0-1)/2)+K 
UCU  Unconstrained Constrained Unconstrained Lx = AA, +B K(|PQ—Q(Q-1)/2]+P 
UCC Unconstrained Constrained Constrained Ly = Ax, + WIp K[PO-Q(0-1)/2]+1 
CUU Constrained Unconstrained Unconstrained L= AA’ + [PO—Q(Q-1)/2]+ KP 
CUC Constrained Unconstrained Constrained Ly = AA! + ywIp [PQ-Q(Q-1)/2)+ Kk 
CCU Constrained Constrained Unconstrained x = AA + [PO-Q(0-1)/2]+P 
CCC Constrained Constrained Constrained Lx = AA’ + Ip [PO-0(0-1)/2]+1 
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3 Maximum likelihood estimation 


Even in this relatively general framework, the parameters of the proposed parsimo- 
nious HMMs can be estimated using the method of maximum-likelihood. In order 
to perform maximum likelihood estimation of the above model on the basis of the 
multivariate response y; = {y;1,-..,);p}, computation of the likelihood function 


L (0) = a'f(y1)2Hf(y2)3I...f(yr_1)7If(yr)1 (7) 


is necessary. Here, & = { Uk, Ax, Yk, Yjk, Tk, k = 1,2,...,K} is the set of all model 
parameters, f(y,) denotes a diagonal matrix with conditional probability densities 
f(X: =y: |S; = ks Uk, Ak, Ye) on the main diagonal and 1 represents a unit vector 
of size K. 

To maximize (7) with respect to @, we introduce a three-step AECM based on 
the following steps: 


Step 1. Fita homogeneous HMM, i.e. without covariates, for the multivariate con- 
tinuous outcomes. Maximum likelihood estimation is performed by maximizing 
(7) under the constraint ;II = II using an AECM algorithm. The motivation be- 
yond the use of the AECM algorithm lies in its ability to break the model into 
smaller models. On the basis of this preliminary fitting, we obtain the final esti- 
mates of the conditional distribution parameters. 

Step 2. For each time t = 1,...,7,we obtain the posterior expected values of state 
membership on the basis of the first step. 

Step 3. Maximize the component of the (complete-data log-) likelihood involving 
the hidden structure parameters. 


After an initial estimate of the latent parameters, the second and the third steps 
are iterated until convergence, while keeping fixed the estimates of the conditional 
distribution parameters from the first step. 

At the first step of the algorithm, we partition the set of unknown parameters @ in 
two disjoint subsets (01,02): 6; contains the hidden chain parameters 7 and II and 
the elements of the state-specific means Ly, while 02 consists of Ag and ‘PW. Then, 
the following steps are alternated until convergence in order to carry out the AECM 
algorithm: 


e First stage 


E-step: compute the conditional expectation of the complete-data log-likelihood, 
given the observed data and the current estimate of the parameter vector (U, 
x, II), while keeping (A, Y) fixed at their values resulting from the previous 
iteration. 

M-step: maximize the preceding expected complete-data log-likelihood func- 
tion with respect to (u, 7, IT). 


e Second stage 
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E-step: compute the conditional expectation of the complete-data log-likelihood, 
in this step conditional on (A, ¥), while considering (u, 7, II) fixed as given 
by the calculations in the first stage of the AECM algorithm. 

CM-step: maximize the preceding expected complete-data log-likelihood func- 
tion with respect to (A, Y). The (conditional) maximization step depends on 
the imposed model restrictions. 


Once the AECM achieves convergence at the first step, we obtain the posterior 
probabilities of belonging to a state and use these to get estimates of ,7j;. The 
estimated parameters for the hidden process are the solutions of weighted sums of 
K multinomial regressions. We then update the posterior probabilities and iterate 
Step 2 and Step 3, plugging in the estimated transition probabilities into the log- 
likelihood function, till further convergence. 
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A flexible analysis of PISA 2015 data across 
countries, by means of multilevel trees and 
boosting. 


Analisi comparativa dei risultati di PISA2015: 
un’applicazione di alberi a effetti misti e Boosting. 


Chiara Masci, Geraint Johnes and Tommaso Agasisti 


Abstract The aim of this work is to analyze and compare PISA2015 results in math- 
ematics in nine world countries, finding out which are student and school levels 
characteristics related to students’ performances. Based on the fact that education 
systems are different across countries, the main methodological issue is to use flexi- 
ble methods that do not force any functional relationships between the variables. We 
therefore apply tree-based methods in a two-stage procedure: in the first stage, ran- 
dom effect regression trees are used in order to relate student performances to stu- 
dents’ characteristics and to estimate school-value added; while in the second stage, 
school value-added is related to school level characteristics by means of regression 
trees and boosting. Results show that three-based methods well fit the problem, be- 
ing able to explain a good part of variability and identifying different significant 
features across countries. 

Abstract L’obiettivo di questo lavoro é analizzare e paragonare i risultati del test 
PISA 2015 in matematica in nove paesi del mondo, individuando quali sono le 
caratteristiche a livello studente e scuola maggiormante legate alle performance 
degli studenti. Viste le differenze nei vari sistemi scolastici del mondo, l’obiettivo 
metodologico dell’analisi é trovare un metodo abbastanza flessibile, da non forzare 
nessuna relazione tra le variabili. Applichiamo quindi metodi basati sugli alberi 
di regressione in una procedura a due stadi: nel primo stadio, applichiamo alberi 
di regressione multilivello per identificare le variabili a livello studente legate ai 
risultati degli studenti e per stimare l’effetto scuola; nel secondo, identifichiamo le 
variabili a livello scuola legate all’effetto scuola, usando alberi di regressione e 
boosting. I risultati mostrano che tecniche basate sugli alberi sono adatte ad anal- 
izzare questo tipo di dati e rivelano come le caratteristiche legate alle performance 
degli studenti siano diverse nei vari paesi. 


Chiara Masci 
Politecnico di Milano, via Bonardi 9, Milano 20133, e-mail: chiara.masci @ polimi.it 


Geraint Johnes 
Lancaster University, Lancaster, LA14YX e-mail: g.johnes@lancaster.ac.uk 


667 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


668 Chiara Masci, Geraint Johnes and Tommaso A gasisti 


Key words: Random effect regression trees, boosting, student achievements, school 
value-added. 


1 Introduction 


The educational system is a complex and unknown process that varies across and 
within countries. The determinants that play a role in this process are various and 
arising from different levels of the scholastic system and of students life. Indeed, the 
learning process of students is not only influenced by students’ own characteristics, 
but also by the family, the peers, the context in which they live, their class/school- 
mates, and by the characteristics of the school that they are attending. When trying 
to analyze the educational process, it is worth but difficult to take into account all 
these aspects, especially, because their marginal impact on student achievements is 
unknown and the interactions between them further complicates the process itself. 

Our aim is to identify which are the student and school levels characteristics that 
are related to student achievements, to investigate their impacts on the outcome and 
how these impacts interact among them, within nine world countries". In particular, 
our research questions are: 


e Which student level characteristics are related to student achievements? 

e How much of the total variability between student achievements can be explained 
by the difference between schools and how can we estimate the school value- 
added? 

e Which school level characteristics are related to school value-added and in which 
way? 

e How the important variables interact among them in influencing the outcome 
variable? 

e How these relationships vary across countries? 


In order to address these issues, we develop a two stage-analysis: (i) in the first 
stage, we apply random effects tree-based estimation methods, called RE-EM tree 
(see Sela and Simonoff (2012)) in which we consider students (level 1) nested within 
schools (level 2) - by means of this model we can both analyze which are the stu- 
dent level variables that are related to student achievements and estimate the school 
value-added -; (ii) in the second stage, we apply regression trees (see Gareth et al. 
(2013)) and boosting (see Elith et al. (2013)) to identify which are the school level 
characteristics related to school value-added (estimated at stage (1)), how they are 
related to the outcome and how they interact among eachother. 


1 The 9 selected countries are: Australia, Canada, France, Germany, Italy, Japan, Spain, UK, USA. 
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2 The Dataset 


The Programme for International Student Assessment (PISA) is a triennial inter- 
national survey (started in 2000) which aims to evaluate education systems world- 
wide by testing the skills and knowledge of 15-year-old students (see Pena-Lopez 
(2016)). Students are assessed in various disciplines and a set of student and school 
levels characteristics are available, thanks to questionnaires that students and school 
principals had to fill out. In our analysis, we use PISA data of 2015 of nine world 
countries: Australia, Canada, France, Germany, Italy, Japan, Spain, UK and USA. 
Regarding the student level, we consider information about his/her gender, immi- 
grant status, socio-economical index, the time he/she spends studying, some built 
indicator about his/her approach to the school and to the subject (anxiety, effort, 
collaboration, perception of school climate..) and information about his/her fam- 
ily (home resources, support..). While at school level, we consider information 
about the school body composition (school size, percentages of disadvantaged stu- 
dents..), school resources ( computers, number of teachers, materials..), “manage- 
ment” (principal characteristics, funds..), school climate (students truancy, teacher 
absenteeism..) and participation of students’ families. 


3 Methodology 


There are three main points to be taken into account when modeling this kind of edu- 
cational data: data levels of grouping (students within schools), realistic assumptions 
on the relationships across variables and interactions. 

This is why we decide to move to a machine learning approach, applying a two- 
stage procedure. In the first stage, we apply a (two-level) random effect regression 
tree (RE-EM tree), with random intercept. The response variable is the student (level 
1) PISA test score in maths, that is regressed against a set of student level charac- 
teristics (fixed effects) and where students are nested within schools (level 2). By 
means of this model, we can estimate the fixed effects of student level predictors on 
the outcome and we can also estimate the school value-added. In the second stage, 
we regress the estimated school value-added against a set of school level character- 
istics, by means of regression trees and boosting. 


3.1 First stage: RE-EM trees 


RE-EM trees work basically as random effects linear models (see Snijders (2011)), 
but relaxing the linearity assumptions of the fixed covariates with the response. In- 
stead, a regression tree is built for the fixed part. If we consider students (level 1) 
nested within schools (level 2), the model takes the form: 
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Vij = f Xij- + siip) +O; + ij (1) 
with 
b ~ N(0,02), 2) 
£e ~ N(0, 02) (3) 
where f(xij1,-..,Xijp) involves a partition of the predictor space and 


yij is the maths PISA test score of student i within school j; 
Xij1,+++,Xijp are the p-predictors at student level; 

b; is the random effect of school j; 

&; is the error. 


One of the advantages of multilevel models is that we can compute the Percent- 
age of Variability explained by Random Effects (PVRE): 


2 


PVRE = —22 (4) 
- op +02 


3.2 Second stage: Boosting 


Regression trees have a series of advantages: they do not force any functional re- 
lationship between the answer variable and the covariates; they can be displayed 
graphically and are easily interpretable; they can handle qualitative predictors and 
they allow interactions between the variables. Nevertheless, they suffer from high 
variance and they are very sensitive to outliers. For this reasons, there are methods 
that reduce variance and increase predictive power, like Bagging, Random Forest 
and Boosting (see Gareth et al. (2013)). 

Boosting (see Elith et al. (2013)) is a method for improving model accuracy, 
based on the idea that it is easier to find and average many rough rules of thumb, 
than to find a single, highly accurate prediction rule. Related techniques - includ- 
ing bagging, stacking and model averaging - also build, then merge results from 
multiple models, but boosting is unique because it is sequential: it is a forward, 
stagewise procedure. In boosting, models (e.g. regression trees) are fitted iteratively 
to the training data, using appropriate methods gradually to increase emphasis on 
observations modeled poorly by the existing collection of trees. 


4 Results 


Table 1 shows RE-EM trees results. For each country, we obtain the portion of 
explained variability (R?) by the RE-EM tree model, the PVRE, the tree of fixed 
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effects and the estimated school values-added. R°s are relatively high in almost all 
the countries, suggesting that the model is able to catch a good part of variability 
in the data. The PVREs are quite different across countries, meaning that there are 
countries (e.g. France or Japan) where differences across schools are quite big, and 
others (e.g. Spain or Australia) where the impact of attending certain schools respect 
than others is small. 


Country of of  PVRE R? 


Australia 0.690 0.125 15.41% 33.59% 
Canada 0.724 0.143 16.49% 29.93% 
France 0.464 0.419 47.47% 55.28% 
Germany 0.525 0.437 45.44% 50.17% 
Italy 0.568 0.395 41.04% 45.57% 
Japan 0.510 0.437 46.13% 50.32% 
Spain 0.706 0.068 0.08% 30.11% 
UK 0.695 0.162 18.97% 32.51% 
USA 0.689 0.132 16.15% 33.45% 


Table 1 RE-EM trees results in the nine selected countries. 


Results of second stage regression tree boosting may be summarized, in each 
country, in (i) variables importance ranking (boosting gives an idea of how much 
each school level variable is “important” in explaining the school value-added); 
(ii) single and joint partial plot (partial plot gives a graphical representation of the 
marginal effect of each predictor on the response variable, after “averaging-out” the 
effects of the other predictors. Joint plots represent the joint impact of two predictors 
on the response); (iii) percentage of explained variability by the model. 

In order to give an example, in Australia, we are able to explain about 40.2% of 
the total variability and the four most important variables result to be: percentage 
of disadvantaged students, percentage of funds given by the government, student 
truancy and percentage of students with special needs. Figure | reports the partial 
plot of these 4 most important variables and Figure 2 reports an example of 2 joint 
plots. 


5 Conclusions 


This paper analyzes PISA2015 test scores in mathematics in nine world countries, 
by means of a flexible way able to fit different education systems and to identify 
different patterns within the data. Methodology takes into account the hierarchical 
structure of data, it does not force any functional relationships between variables 
and it allows for interactions between them. Results show the high predictive power 
of tree-based methods in this context, identifying the most significant variables in 
affecting students’ performances and highlighting heterogeneities across countries. 
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Fig. 2 Joint Partial plots in Australia. Color identifies the values of school value-added. 


Impact of the 2008 and 2012 financial crises on 
the unemployment rate in Italy: an interrupted 


time series approach 


Impatto delle crisi finanziarie del 2008 e del 2012 sul 
tasso di disoccupazione in Italia: approccio di analisi 


basato sulle serie temporali interrotte 


Lucio Masserini and Matilde Bini 


Abstract 

One of the most widely recognized indicators of a recession is a rising 
unemployment rate. In Italy, from the late nineties this indicator continuously 
decreased over time until 2007. The aim of this paper is to study the immediate 
impact and persistence of the 2008 global financial crisis and the 2012 European 
sovereign debt crisis on the Italian unemployment rate by using a segmented 
regression analysis approach of interrupted time series. Quarterly data were collected 
from the website of the Italian National Institute of Statistics. In particular, the 
impact of the financial crises was evaluated across some subpopulations of interest 
by stratifying unemployment rate for age groups, in order to highlight the effects on 
youth unemployment, gender and macro-regions. Finally, to provide a more in-depth 
analysis, some information on the effects of the two economic recessions was also 
given about the people not engaged in Education, Employment or Training. 
Abstract Uno degli indicatori di recessione più utilizzati è il tasso di 
disoccupazione. In Italia, dalla fine degli anni novanta tale indicatore è 
costantemente diminuito fino al 2007. Lo scopo di questo lavoro è quello di studiare 
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l'impatto immediato e la persistenza della crisi finanziaria globale del 2008 e la 
crisi del debito sovrano europeo 2012 sul tasso di disoccupazione italiano, 
utilizzando un'analisi di regressione di serie temporali interrotte. I dati sono stati 
raccolti sul sito dell'Istituto Nazionale Italiano di Statistica. In particolare, l'impatto 
delle crisi finanziarie è stato valutato per alcune sottopopolazioni di interesse 
stratificando il tasso di disoccupazione per età, al fine di evidenziare gli effetti sulla 
disoccupazione giovanile, per genere e per macro-regioni. Infine, per fornire una 
descrizione più approfondita del fenomeno, l’analisi è stata estesa anche ai giovani 
non occupati e non in istruzione e formazione. 


Key words: unemployment rate, interrupted time series analysis, segmented 
regression 


1 Introduction 


During this past decade two economic crises had a severe impact in all countries 
around the world. More specifically, after the economic decline observed in world 
markets during the late 2000s and early 2010s ,which generated the Great Recession 
defined by the International Monetary Fund as the worst global recession since the 
Great Depression of the 1930s, a sovereign debt crisis faced to European countries in 
2009, resulting in a second economic recession in the years after (2011-2015). 
These crises produced negative effects on GDP growth, on economic performance, 
on the labour productivity and on labour markets. The International Labour 
Organization (ILO, 2001) revealed that due to the global economic crisis, in 2009 
about 22 million people were unemployed worldwide in particular in developed 
economies and in the European Union. During this period the unemployment rate 
continued towards a dramatically increase with high and persistent levels of 
unemployment in young people. In Italy the unemployment problem is worrying 
since it affects particular segments of the labour market, such as the younger 
generations and some macro regions. 

The aim of this study is to assess and measure whether and how much the 
aforementioned financial crises have changed the level and trend in the UR and in 
the young people who are neither employed nor in education or training (NEET), 
immediately and over time, and to see if these changes are short-or long-term. 
Whereas UR is a widely recognized indicator of a recession, the NEET has been 
considered since it provides a measure of disengagement from the labour market and 
perhaps, more generally, quantifies also how many people are sliding towards the 
borders of the active society. A segmented regression approach of Interrupted Time 
Series (ITS) analysis is used by analysing quarterly data collected from the Italian 
National Institute of Statistics (ISTAT). 
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2 Data and empirical strategy 


Data were collected from I.Stat, the warehouse of statistics currently produced by the 
Italian National Institute of Statistics (ISTAT) which provides an archive of about 
1,500 time series (http://dati.istat.it/). Quarterly data on two different kind of 
indicators were downloaded from the theme ‘Labour and wages’: UR for the period 
1993-2016, overall and stratified by gender, age groups and macro-regions; and the 
percentage of NEET for the period 2004-2016, overall and stratified by gender. 
Such data are derived from the official estimates obtained in the Labour force 
survey, carried out on a quarterly basis interviewing a sample of nearly 77,000 
households representing 175,000 individuals. According to the Eurostat definition 
(Eurostat, 2017), UR is given by the number of people unemployed as a percentage 
of the labour force. The youth unemployment rate (YUR) is the number of 
unemployed 15-24 year-olds expressed as a percentage of the youth labour force, 
and the NEET refers to the percentage of people aged between 15 and 29 years who 
currently do not have a job, are not enrolled in training or are not classified as a 
student. Figure la illustrates the trend in the overall UR in Italy from 1993 to 2016. 
The choice of such a long period allows for a more accurate estimate of the secular 
trend, and this will be useful for the subsequent analysis. As shown, UR rises since 
the mid-nineties until 1998, when it reaches 11.6%. Thereafter, it steadily declines 
until the third quarter of 2007 (2007q3), falling to 5.7%, which represents the 
minimum value observed throughout the period. Starting from the fourth quarter of 
2007 (2007q4), period in which the effects of the financial crisis following the 
bankruptcy of Lehman Brothers begin to appear, UR undergoes a first shock. As a 
result, it shows a trend reversal, although with some obvious fluctuations, and its 
value rises in the subsequent two-years period, known as Great Recession, oscillating 
between 7% and 9% in 2010-2011, when it reaches roughly the same level of a 
decade before. Afterwards, the European sovereign-debt crisis which occurred in the 
late 2011 (2011q4) causes a second shock, and UR increases even more dramatically 
up to 13.5% at the end of 2014 (2014q4), after a six quarter recession for the euro 
area economy. After this peak, UR seems to show a slight trend reversal, although it 
is perhaps still too early to consider this as a possible structural change. 


+ gJ 


12 
24 


16 


© 


12 


1993q1 199691 1999q1 2002q1 200591 2008q1 2011q1 2014q1 2017q1 199991 200291 200591 200891 2011q1 201491 201791 


quarter quarter 


(a) (b) 
Figure 1: Total UR (a) and percentage of NEET (b) in Italy over the observed period 
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On the other hand, Figure 1b illustrates the trend of the overall percentage of NEET 
in Italy from 2004 to 2016. For this indicator, the series is shorter because data from 
previous years are not available. However, the trend can still be detected and it 
seems, at least partly, similar to that of UR. Indeed, after a slight decrease in the 
period before the onset of the global financial crisis (2007q3), the percentage of 
NEET starts a steeply and steady growth that continues unchanged also after the 
occurrence of the sovereign debt crisis (2011q3). And here too, a trend reversal 
seems to occur starting from the end of 2014 (2014q4). In the light of the previous 
considerations, the analysis period was divided into the following four sub-periods, 
arising from the two successive financial shocks and by a slight trend reversal 
observed in the last two years: the period before the so-called 2008 global financial 
crisis (until 2007q3); the subsequent three-year period known as the Great Recession 
aftermath of the financial crisis, characterised by a general economic decline 
observed in world markets (2007q4—2011q3); the period following the European 
sovereign debt crisis, which resulted in a second economic recession (2011q4— 
2014q4); and finally, the last two years (2015q1—2016q4), during which it seems to 
glimpse a slight decrease in both the UR and the percentage of NEET. Consequently, 
the three breaks in the series were set in 2007q4, 2011q4 and 2015q41. Moreover, as 
regard the UR, since the historical trend has changed substantially in the late nineties 
(see Figure la), data prior to 1999 were removed from the analysis in order to obtain 
a more accurate estimate of the underlying trend before the first interruption of the 
series. Therefore, the analysis period is limited to the years 1999q1—2016q4 for the 
UR and to 2004q1—2016q4 for the percentage of NEET. The interruptions allow to 
highlight the severity of the two financial crises, respectively, and continuation of 
their effects in the subsequent years of recession, as shown by the sharp change 
observed in UR at the beginning of each period and the successive trend. 


3 Interrupted time series analysis 


In this study, a segmented regression approach of interrupted time series (ITS) 
analysis was carried out in order to assess and measure, in statistical terms, whether 
and how much the two financial crises have changed the level and trend in the 
outcome variables, immediately and over time, and to see if these changes are short- 
or long-term (Wegner, Soumerai and Zhang, 2002). 

ITS analysis (Shadish, Cook and Campbell, 2002) is a simple but powerful tool 
used in quasi-experimental designs for estimating the impact of population-level or 
large scale interventions on an outcome variable observed at regular intervals before 
and after the intervention. In such circumstances, ITS allows to examine any change 
on the outcome variable in the post-intervention period given the trend in the pre- 
intervention period (Bernal, Cummins and Gasparrini, 2016). In this respect, the 
underlying secular trend in the outcome before the intervention is determined and 
used to estimate the counterfactual scenario, which represents what would have 
happened if the intervention had not taken place and serves as the basis for 
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comparison. For the purposes of our study the interventions are given by two 
unplanned and real-world events, the aforementioned and well recognized financial 
crises. In segmented regression of ITS, each sub-period of the series is allowed to 
exhibit its own level and trend, which can be represented by the intercept and slope 
of a regression model, respectively. The intercept indicates the value of the series at 
the beginning of an observation sub-period; and the slope is the rate of change 
during a segment (or sub-period). Therefore, by following this approach it is possible 
to compare the pre-crisis level and trend with their post-crisis counterpart, in order to 
estimate the magnitude and statistical significance of any differences. 

The ITS regression model with a single group under study (here, the Italian 
population), two interventions, which in this study are given by the two economic 
recessions in 2007q4 and 2011q4, and a possible UR trend reversal in 2015q1, can 
be represented as it follows (Linden and Adams, 2011; Bernal, Cummins and 
Gasparrini, 2016): 


JF By + AT, + PX; 20074 + BT 200794%:200794 + PaXi201144 + PsT,201144%1201144 + 


+BX,201591 + PT 201591201591 +E, 


In particular, y; is the aggregated outcome variable at each equally-spaced time- 
points 1, here represented by quarters; 7; is the time elapsed since the start of the 
study, where # varies between 1999q1 to 2016q4 for UR and between 2004q1 to 
2016q4 for NEET, respectively; x;200794 is a dummy variable indicating the onset of 
the global financial crisis in fourth quarter of 2007, coded as 0 (pre-crisis period) 
and 1 (post-crisis period); T,200744X1200744 is the interaction term between time and the 
200794 global financial crisis; x;201144 is a dummy variable indicating the onset of the 
2011q4 European sovereign debt crisis, coded as 0 (pre-crisis period) and 1 (post- 
crisis period); and T;201144x201194 is the interaction term between time and 2011q4 
European sovereign debt crisis. Finally, x:20/591 is a dummy variable indicating the 
time in which a trend reversal might have occurred, coded as 0 (before the trend 
reversal) and 1 (after the trend reversal); and 77;20/541X:201541 is the usual interaction 
term. Accordingly, fo is the intercept and represents the starting level of the outcome 
variable at T = 1999q1 for UR and T = 2004q1 for NEET, respectively; £, is the 
slope and represents the trajectory (or secular trend) of the outcome variable until the 
2007q4 global financial crisis; 2 is the level change that occurs immediately 
following the 2007q4 global financial crisis (compared to the counterfactual); J3 is 
the difference between the slope pre and post the global financial crisis; £4 is the 
level change that occurs immediately following the 2011q4 European sovereign debt 
crisis; fs is the difference between the slope pre and post the European sovereign 
debt crisis; 86 is the level change that occurs immediately following the 2014q4 
(compared to the counterfactual); fo is the difference between the slope pre and post 
the trend reversal; and & represents the random error term which is assumed to 
follow a first auto-regressive (AR1) process. The regression coefficients are 
estimated by using Ordinary least-squares (OLS) method with the Newey-West 
(1987) standard errors. 
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4 Results 


Four periods of linear trend were considered to analyse UR and NEET, with 
interruptions at 2007q4, 2011q4 and 2015ql, respectively. Separate segmented 
regression models were then estimated for age groups, gender and macro-regions via 
ordinary least-squares by using Newy-West standard errors in order to handle one 
lag autocorrelation. To account for the correct autocorrelation structure, Cumby- 
Huizinga test (Cumby and Huizinga, 1992) was performed and results confirm that 
autocorrelation was present at lag 1, but not at higher orders (up to the 9 lags were 
tested). Results are shown in Table 1 for the UR and in Table 2 for the NEET. 
Specifically, for the purposes of this study, will be commented only the coefficients 
Bo- Bs which summarize the trend of the dependent variables before and after the two 
financial crisis, respectively. In fact, the interruption at 2015q1 was introduced in 
order to have a more correct estimate of the trend in the previous period so as to 
have a proper assessment of rate and trend changes. As regards the UR, the 1999 
base rate showed some variability in the considered sub-groups. In fact, starting from 
11.020 at national level, its value was particularly higher for the age group 15-24 
(26.909) and for the South macro-regions (20.683) but lower for the North East 
(4.640) and North West (5.927) macro-regions, as well as for the males (8.353) and 
for the 45-54 age group. Moreover, its trend prior to the 2008 global financial crisis 
(1999q1—2007q3) showed a significant and general decrease, both at national level (- 
0.138; p < 0.001) and for the different age groups, macro regions and gender. Such 
reduction was more pronounced for the sub-groups traditionally considered as the 
most vulnerable ones, namely South macro-regions (-0.269; p < 0.001), females (- 
0.205; p < 0.001) and YUR (-0.183; p < 0.001). The onset of the global financial 
crisis (2007q4) caused an immediate and substantial UR increase at national level 
(+0.788; p < 0.05) and in almost all the considered sub-groups but no significant 
change was detected for younger people (age groups 15-24 and 25-34) and the 
North-East macro region. In particular, the more severe direct consequences were 
observed among females (+1.061; p < 0.001), for people in the central (+1.061; p < 
0.001) and southern regions (+0.992; p < 0.05) and for the intermediate age group 
35-44 (+0.997; p< 0.001). The aftermath of the financial crisis were quite strong 
and resulted in the Great recession in the subsequent years during which a substantial 
and significant trend change was observed compared to the previous period (+0.255 
p < 0.001). However, in this case, the most serious consequences occurred 
particularly for YUR (+0.716; p< 0.001) and, to a much lesser extent, for the South 
macro-regions (+0.371; p < 0.001). On the other hand, the immediate consequences 
of the second financial crisis, following the European sovereign debt crisis (2011q4) 
were even stronger when compared to the previous financial crisis and resulted in a 
second economic recession, with an UR increase almost double at the national level 
(+1.583; p < 0.001). Such increase was higher for YUR (+3.696; p < 0.001) and for 
the South macro region (+2.634; p < 0.001) while there was no significant increase 
again for North East macro region. 


Table 1: Estimates of the impact of the 2007q4 and 201194 financial crises on the UR in Italy 


va Base rate Trend Rate Trend Rate Trend Rate Trend 
(1999) 1999q1- change change change change change change 

200793 2007q4 200794 201194 201194 2015q1 2015q1 
Overall 11.020*** -0.138*** 0.788"* 0.255** 1.583*** 0.137 -0.492 -0.425*** 
Males 8.352" -0.095*** 0.602* 0.250*** 1.459°™* 0.112 -1.303 -0.428""" 
Females 14.989*** -0.205*** 1.061*** 0.267""* 1.653°™* 0.190%" -1.638"" -0.280* 
15-24 26.909"** -0.183** 0.970 0.716" 3.696% 0.345 -3.350 -1.611% 
25-34 11.142*** -0.068*** 0.127 0.290*** 1.566% 0.246™* -1.873 -0.588""" 
35-44 7.863*"* -0.095*** 0.997"** 0.177" 1.245°** 0.166% -1.333"" -0.245* 
45-54 6.505*** -0.099*** 0.725** 0.196"** 111" 0.109* -0.748 -0.296*** 

55-64 8.076"** -0.166% 0.847" 0.219*** 1.350*** -0.005 -0.167 -0.029 
Northwest 5.927" -0.067% 0.826" 0.222™** 0.888"" 0.014 -0.895 -0.307"" 
North 4.640" -0.038""" 0.244 0.171" 0.842 0.011 -0.637 -0.306*** 
Center 8.736*** -0.098*** 1.061*** 0.193** 1.319* 0.140* -1.022 -0.356"" 
South 20.683** -0.269*"* 0.992"* 0.371* 2.634" 0.360%" -2.445% -0.478""" 

Table 2: Estimates of the impact of the 2007q4 and 201 1q4 financial crises on the percentage of NEET in Italy 

vee Base rate Trend Rate Trend Rate Trend Rate Trend 
(2004) 2004q1- change change change change change change 

200793 2007q4 200794 2011q4 201194 2015q1 2015q1 
Overall 19.973*** -0.061 -0.358 0.352""* 0.006 0.032 -1.673"" -0.542*" 
Males 15.148** 0.013 -0.410 0.361 0.309 0.004 -1.507% -0.744"" 
Females 24.713** -0.127* -0.346 0.332** -0.302 0.062 -1.841™ -0.331™" 
Northwest 12.639°** -0.078 0.607 0.400*** -1.161 0.067 -1.734* -0.634"" 
North East 10.521°** -0.008 0.521 0.377" 0.113 -0.144 -0.791 -0.653"" 
Center 15.479°** -0.089*** -1.243 0.477" 0.157 -0.055 -1.808"* -0.538"" 
South 29.884""* -0.074 0.385 0.297" 0.549 0.132* -2.001™* -0.445* 


*p<0.10; ™ p < 0.05; **p< 0.01 
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After this second financial shock, the UR seems to further accelerate its increase 
only in some sub-groups while at national level no significant trend difference was 
observed. In particular, such acceleration was particularly higher for the South 
macro-regions (+0.360; p < 0.001), age group 25-34 (+0.246; p < 0.05) and females 
(+0.190; p < 0.001) while no significant further rate increase was detected for YUR. 
However, it should be emphasized here that this further increase, although lower 
than the one highlighted during the Great Recession, where present has to be added 
to that already existing, thus making particularly critical the situation. As regards the 
percentage of NEET, a considerable heterogeneity was found in the 2004 base rate, 
which was 19.973 at national level. Its value was higher for the South macro-regions 
(29.884) but lower for the North East (10.521) and North West (12.639) macro- 
regions; moreover, it was higher for females (24.713) than males (15.148). On the 
other hand, its trend prior to the 2008 global financial crisis (2004q1—2007q3) was 
basically constant at national level, with the only exception for the macro-regions of 
Center, which showed a slightly descending trajectory (-0.089; p < 0.001). The onset 
of the of the global financial crisis (2007q4) did not cause an immediate impact on 
the percentage of NEET, overall and in any of the considered sub-groups. However, 
a significant trend change was found both at national level (+0.352; p < 0.001) and 
for all the analysed sub-groups; such change was particularly higher only for the 
macro-regions of Center (+0.477; p < 0.001). The European sovereign debt crisis 
(2011q4) does not seem to alter this situation, neither for the rate change nor for the 
trend change. Therefore, this means that after this second financial crisis the rise of 
the percentage of NEET remains steady and equal to the previous period without 
showing any jump. 
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An R Package for Cluster-Weighted 
Models 


Un Pacchetto R per i Modelli 
Cluster- Weighted 


Angelo Mazza and Antonio Punzo and Salvatore Ingrassia 


Abstract Cluster-weighted models (CWMs) are mixtures of regression mod- 
els with random covariates. However, besides having recently become rather 
popular in statistics and data mining, there is still a lack of support for 
CWMs within the most popular statistical suites. In this paper, we intro- 
duce flexC WM, an R package specifically conceived for fitting CWMs. The 
package supports modeling the conditioned response variable by means of 
the most common distributions of the exponential family and by the t dis- 
tribution. Covariates are allowed to be of a mixed-type and parsimonious 
modeling of multivariate normal covariates, based on the eigenvalue decom- 
position of the component covariance matrices, is supported. Furthermore, 
either the response or the covariates distributions can be omitted, yielding 
to mixtures of distributions and mixtures of regression models with fixed co- 
variates, respectively. The expectation-maximization (EM) algorithm is used 
to obtain maximum-likelihood estimates of the parameters and likelihood- 
based information criteria are adopted to select the number of groups and/or 
the parsimonious model. For the component regression coefficients, standard 
errors and significance tests are also provided. Parallel computation can be 
used on multicore PCs and computer clusters, when several models have to 
be fitted. 

Abstract I modelli cluster-weighted (CWMs) sono misture di regressioni con 
covariate random divenuti piuttosto popolari negli ultimi anni. Nonostante 
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ciò, i software statistici più comuni non offrono supporto per tali modelli. 
Per ridurre tale gap, in questo lavoro introduciamo il pacchetto R denomi- 
nato flexCWM che permette di fittare un’ampia gamma di modelli cluster- 
weighted. In particolare, il pacchetto supporta le più comuni distribuzioni 
della famiglia esponenziale, nonché la distribuzione t, per quanto riguarda 
la variabile risposta. Le covariate possono essere di tipo misto; per quelle 
distribuite secondo una normale multivariata, è possibile considerare CWMs 
parsimoniosi attraverso una ben nota scomposizione spettrale delle matrici 
di covarianze. Inoltre, sia la variabile risposta che la distribuzione delle co- 
variate possono essere omesse andando a definire, rispettivamente, misture 
di distribuzioni e misture di regressioni con covariate fisse. L’algoritmo EM 
è utilizzato per ottenere le stime di massima verosimiglianza dei parametri, 
mentre criteri di scelta del modello basati sulla verosimiglianza sono utiliz- 
zati per scegliere il numero di componenti della mistura e/o la configurazione 
parsimoniosa ottimale. Per quanto riguarda le stime dei coefficienti di regres- 
sione, vengono calcolati gli standard errors e i comuni test di significatività. 
Infine, il pacchetto permette di fittare più modelli simultaneamente utilizzando 
parallelizzazione dei processi. 


Key words: cluster-weighted models, EM algorithm, mixture models, model- 
based clustering, random covariates 


1 Introduction 


When data at hand are composed by a response variable Y and by a set 
of d covariates X, say (X,Y), and there is a latent source of heterogeneity, 
mixtures of regression models with fixed covariates (see, e.g., DeSarbo and 
Cron, 1988 and Friihwirth-Schnatter, 2006) constitute a reference framework 
of analysis. However, by assuming fixed covariates, modeling for X is not 
considered; furthermore, the assignment of the data points (a, y) to the clus- 
ters is required to be independent from the covariates distribution, as noted 
by Hennig (2000). This assignment independence assumption is generally not 
true in observational studies, and makes mixtures of regression models with 
fixed covariates inadequate in many real data applications. Mixtures of re- 
gression models with random covariates overcome this problem by allowing 
forassignment dependence: the component distributions for X can also be 
distinct and they can affect the assignment of the data points to the clusters. 
Therefore, they are often to be preferred in real data analyses (Hennig, 2000). 
For a comparison between the two approaches, see also Ingrassia et al. (2012) 
and Ingrassia and Punzo (2016). 

A member of the class of mixtures of regression models with random 
covariates is the cluster-weighted model (CWM; Gershenfeld, 1997). The 
CWM assumes a (parametric) functional relation for the local expectation of 
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Y|X = x, and factorizes the local joint distribution p (x,y) into the product 
between the conditional distribution of Y|a and the marginal distribution of 
X. Some recent developments in CWMs can be found in Subedi et al. (2013, 
2015), Punzo (2014), Ingrassia et al. (2014, 2015), Berta et al. (2016), Punzo 
and Ingrassia (2016), Punzo and McNicholas (2017), and Dang et al. (2017). 


2 The proposal 


In this contribution, we introduce the R (R Core Team, 2013) package 
flexCWM, available from CRAN at http://cran.r-project.org/ 
web/packages/flexCWM/index.html, specifically conceived for fitting 
CWMs. The package supports modeling of the conditioned response vari- 
able by means of the most common distributions of the exponential family 
and by the t distribution. Covariates may be of mixed-type; supported dis- 
tributions are multivariate Gaussian, multinomial, binomial, and Poisson. 
Following Banfield and Raftery (1993) and Celeux and Govaert (1995), par- 
simonious modeling for multivariate normal covariates, based on the eigen- 
value decomposition of the component covariance matrix, is supported (see 
Punzo and Ingrassia, 2015). The expectation-maximization (EM) algorithm 
is used to obtain maximum-likelihood estimates of the parameters and sev- 
eral likelihood-based information criteria are adopted to select the number of 
groups and/or the parsimonious model. For the local regression coefficients, 
standard errors and significance tests are also provided. 


3 Conclusions 


Several CRAN packages, supporting modeling by mixtures of regressions, are 
available. A list of them may be found in the task view “Cluster Analysis & 
Finite Mixture Models” of Leisch and Griin (2012), in the section entitled 
“Cluster-wise Regression”. flexmix is one of the most widely used packages 
for mixtures of regression models (Leisch, 2004) and mixtures of regression 
models with concomitant variables (Griin and Leisch, 2008); it implements 
an user-extensible framework for estimation, via the EM algorithm. Other 
packages for mixtures of regression models, include: fpe for mixtures of linear 
regression models and fixed point clusters for linear regression (Hennig, 2013), 
mixreg for mixtures of one-variable regression models (Turner, 2011), and 
mixtools, which provides a set of functions for analyzing a variety of finite 
mixture models, including mixtures of regression models with fixed covariates 
(see Benaglia et al., 2009, Section 5). Within this context, the flexCWM 
package aims at giving support for cluster-weighted modeling, providing also 
an alternative for estimating other classical mixture models. 
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Abstract Nowadays, in a broad range of application areas, the daily data production 
has reached unprecedented levels. These data origin from multiple sources, such as 
documental sources, social media posts, digital pictures and videos and so on. 

The technical and scientific issues related to the data booming have been designated 
as the “Big Data” challenges. To deal with big data analysis, innovative algorithms 
and data mining tools are needed in order to extract information and discover 
knowledge from the continuous and increasing data growing. 

In most of data mining methods, the data volume and variety directly affect 
computational load. 

In this paper, we consider a strategic field like the e-Government one. We illustrate 
the strategies and the methodologies for big data processing and document 
management. 

Abstract Oggi, in una vasta gamma di domini, la produzione dei dati ha raggiunto 
livelli senza precedenti. Questi dati derivano da più fonti, come sorgenti 
documentali, messaggi di social media, immagini digitali, video e così via. 

Le questioni tecniche e scientifiche inerenti la gestione di grosse moli di dati sono 
state designate come “Big Data". Per riuscire a processare con profitto grandi moli 
dei dati, è necessario il ricorso ad algoritmi innovativi e strumenti di data mining 
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per estrarre informazioni e scoprire le conoscenze dal crescente incremento dei 
dati. 

Nella maggior parte dei metodi di data mining, il volume dei dati e la varieta 
influiscono direttamente sul carico computazionale. 

In questo documento consideriamo un settore strategico come quello dell'e- 
government. Illustriamo le strategie e le metodologie per l’elaborazione di grandi 
moli di dati e la gestione dei documenti. 


Key words: Big Data, e-Government, Data Mining 


1 Big Data Processing 


In many application areas the daily data production has reached unprecedented 
levels. According to recently published statistics, in 2012 every day 2.5 EB 
(Exabyte) were created, with 90% of the data created in the last two years [1]. 


This data origins from multiple sources: sensors used to gather climate information, 
social networks, digital pictures and video streaming, and so on. Moreover, the size 
of this data is growing exponentially due to not expensive media (smartphones and 
sensors), and to the introduction of big Cloud Datacentres. 


The technical and scientific issues related to the data booming have been designated 
as the “Big Data” challenges and have been identified as highly strategic by major 
research agencies. Most definitions of big data refer on the so-called three V s: 
volume, variety and velocity, referring respectively to the size of data storage, to the 
variety of source and to the frequency of the data generation and delivery[2,3]. 


To deal with big data analysis, innovative approaches for data mining and processing 
are required in order to enable process optimization and enhance decision making 
tasks. To achieve this, an increment on computational power is needed and dedicated 
hardware can be adopted. 


2 Thestrategic field of e-Government in Italy 


E-Government, or electronic management of public services (or e-Gov), or processes 
of democratic governance, concerns the reorganization of the bureaucratic processes 
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in both central and local Public Administrations. In this context, one of main goal of 
e-Gov is that of providing a strong computerized management of electronic 
documents in order to optimize the work of the governmental offices and offer the 
users (citizens and businesses) both faster and more effective services and new ways 
of accessing such services. 


From a general point of view, the theme of e-Government can be traced back to the 
overlap between two worlds that are apparently different and distant from each other; 
in particular, it can be considered as the application of Information and 
Communication Technology (ICT) to problems that are typical both of the Public 
Administration and the legal domain. 


The use of ICT in the public administrations is not new, being introduced some 
decades ago with a series of specific projects, which were often the evolution of 
pre-existent legacy applications, conceived to automate single parts of the 
information and bureaucratic system and devoid of a systemic and global vision[4,5]. 


Many initiatives, often supported by facilitated finances, were introduced in the 
eighties within the Community in order to deeply introduce ICT into public 
administrations and realize strong and flexible information systems, flexible to 
changing and with the objective of supporting the principal bureaucratic processes 
within specific domains (Ministries, Local Bodies, Regions, etc.). 


In the nineties and until the beginning of the present decade, with the spread of the 
Internet and the related technologies, the focus has been moved towards the opening 
of such systems to the web, in order to carry out initiatives of e-Gov and define a 
first level of interconnectivity shared among the administrations belonging to 
different domains, principally in the national environment, but also in an 
international one[6,7]. 


Nowadays, the process of combining the effectiveness of the services and their 
transparence within Public Administration context, goes through a strong automation 
of the internal processes and in addition through the capacity of using open systems, 
able to cooperate at application levels, following federate models: in this way, it is 
possible to ensure the observance of legal and organizational binding forces 
established by the autonomy of the various governmental Entities and the 
achievement of automatic and inter-domain bureaucratic processes. 


Note that such technologies are not always directly and easily suitable to the 
specificities of the Italian bureaucratic applications (of e-Government) because of 
the binding forces of the specific regulations. 


Generally speaking, the strategic plans provided for by all the actions of 
e-Government have the aims of establishing cooperation and coordination among the 
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different subjects of Public Administration. In the last decade and more, Public 
Administration in Italy has been changing its own organizational structure to enable 
the development of its own information systems with respect to the new application 
requirements, by opening and reorganizing itself, enacting new regulations, 
implementing its own standards and using European and international ones, resorting 
to solutions that often realize real “technological leaps” in the automation solutions 
applied. 


Why revolutionize a bureaucratic organization existing since more than a century 
and based on paper documents and mechanical processes? Why change? 


A first simple answer is given in the following. Looking at the Italian system, the 
need of change is principally due to the strong necessity of a de-bureaucratization 
and a simplification of the processes in order to: i) provide the public and private 
administrative acts with transparencies; ii) to increase in the quality of the offered 
services; ili) to decrease the costs of the organization, thus increasing its efficiency. 


Looking, instead, from a wider point of view, we conclude that there is a great need 
to arrange, for a national system, convenient instruments able to ensure its growing, 
development and competitiveness. 


The system of a Nation can’t compete in the International and Community 
environment without a modern and suitable bureaucratic system, based on the use of 
the new technologies, operating in the Internet, and able to grant to the 
administrative actions continuity, definite times, quality, safety and privacy. 


It is necessary to pass from systems based on computerized procedures, which are 
often centralized and supporting organizations based on paper documents and 
manual processes, to information systems focused on processes, which are often so 
totally automated and completely based on electronic documents that are able to 
optimize and rationalize the use of the human resources involved. 


The incentives to change are above all represented by the spread usage of electronic 
documents and the related processes of dematerialization, by the implementation in 
full cooperation and interoperability of inter-intra domain processes, by the 
availability of qualifying and low-cost technology; by the evolution of the 
communication networks both in terms of available band and capillarity, by the 
safety of the various levels of the system, by effective systems of access control and 
profiling of the users. 
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The main instruments achieved, but still in evolution, concern electronic signature 
for documents legal validity, temporal mark-up for providing temporal evidence, 
digital protocol, long term preservation of electronic documents according to the 
regulations, the service of certified electronic mails to give evidence to the posting 
and receipt of documents. 


In Italy the CNIPA has regulated a model of reference for the interoperability and 
the applicatory cooperation for the Public Administration named “Architecture of the 
Public System of Connectivity and Cooperation (PSC)”; the Public System of 
Cooperation (PSCoop) is a set of technological standards and infrastructural services 
whose objective is enabling the interoperability and the cooperation of the 
information systems for the fulfillment of administrative actions; the services offered 
aim at creating a groundwork to which all the Regions can connect in order to use 
and distribute services through standard protocols, with rules of safety and access 
that are shared and with a prearranged and monitored quality of the service. 


Many Regions and Local Bodies have been equipping themselves to take advantage 
of the offered services and many initiatives promoted by the Ministry of Innovation 
are leading to the sharing of the models and the solutions adopted in order to achieve 
in a short-term period a real solution of interoperability. 


3 Documents and Data Processing 


Note that all the e-Gov applications so far described have dematerialization activities 
as a common and fundamental factor: information, previously stored using graphic 
marks on material (paper) supports, is made immaterial using a codified electronic 
representation, and can be nowadays stored on several digital supports such as 
memories, magnetic or optical disks, tapes or other mature technologies nowadays in 
use. 

Dematerialization is not only a normative and technological challenge but also an 
organizational matter involving various human resources. The transformation of a 
bureaucratic organization based on paper into one based on electronic documents is 
not easily achievable according to general models that are exportable among the 
organizations themselves. 

So far, we have described the main characteristic of the e-Gov system, in particular, 
we note that e-Gov processes are usually characterized by a huge quantity of paper 
documents that need to be properly managed, stored and distributed. In order to 
reduce the huge amount of hard papers for optimizing information communication in 
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terms of consumed time and resources, it is widely agreed that a semantic-based 
dematerialization process will greatly enhance e- Government systems and 
application procedures. 

The dematerialization process implies the application of syntactic-semantic 
methodologies in order to automatically transform the unstructured or sometimes 
semi-structured document into a formally structured, machine readable records. 

The core aspect related to a novel and efficient dematerialization process is the idea 
standing beyond the common document concept, that can be defined as the 
representation of acts, facts and figures directly made or by means of electronic 
processing, and stored on a intelligible support. In other words, a document consists 
of objects such as text, images, drawings, structured data, operational codes, 
programs and movies, that, according to their relative position on the support, 
determine the shape and, consequently the structure of the document itself through 
the relationships between them. During the various and different e- Government 
processing phases, that are really different from an application domain to another, a 
document is processed and eventually stored on various kinds of media, properly 
defined in order to archive and preserve papers, photographic films and microfilms, 
VHS cassettes, Magnetic Tapes, DVD disks, and more. 
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Happy parents’ tweets 
Twitter, genitori e felicita 


Letizia Mencarini, Viviana Patti, Mirko Lai, and Emilio Sulis 


Abstract This article explores opinions and semantic orientation around fertility and 
parenthood by scrutinizing filtered Italian Twitter data. We propose a novel 
methodological framework relying on Natural Language Processing techniques for 
text analysis and social media corpora development, which is aimed at extracting 
sentiments from texts. A multi-layered manual annotation for exploring sentiment 
and attitudes to fertility and parenthood was applied to Twitter data. The corpus was 
analysed through sentiment and emotion lexicons in order to highlight how affective 
language is used in this domain. It emerges that parents express a generally positive 
attitude towards children, while children are more critical towards parents. The 
corpus constitutes a first step to improve our understanding of attitudes towards 
fertility and parenthood in this kind of contents. 

Abstract L’articolo esplora le opinioni e l'orientamento semantico intorno ai temi 
della fecondità e della genitorialità a partire da un’analisi di dati Twitter italiani. 
Viene proposto un nuovo quadro metodologico basato su tecniche di Natural 
Language Processing per l’analisi del testo e lo sviluppo di corpora linguistici da 
social media, finalizzato a estrarre sentimenti da testi. Un'annotazione manuale a 
più livelli è stata applicata ai dati Twitter per esplorare il sentiment e gli 
atteggiamenti degli utenti nei confronti della fecondità e della genitorialità. Il 
corpus è stato analizzato mediante risorse lessicali di emozioni e sentiment, per 
evidenziare come il linguaggio affettivo viene utilizzato in questo dominio. 
Dall’analisi emerge che i genitori esprimono un atteggiamento generalmente 
positivo nei confronti dei figli, mentre i figli sono più critici. Il corpus costituisce un 
primo passo verso la comprensione degli atteggiamenti verso fecondità e 
genitorialità espresse in forma spontanea in questo tipo di testi. 
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linguistic corpora 


! Letizia Mencarini, Dondena Centre for Research on Social Dynamics and Public Policy & 


Dept. of Management and Technology, Bocconi University, Italy; email: letizia.mencarini@unibocconi.it. 


Viviana Patti, Mirko Lai, Emilio Sulis, Dipartimento di Informatica, University of Turin, Italy, 
email: {patti,lai,sulis} @di.unito.it. 


693 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


694 Letizia Mencarini et al. 
Introduction 


The proliferation of sensors, together with the increasing popularity of social media 
leaves traces. This massive dissemination of information heralds a new era in social 
studies, bringing about new research challenges and opportunities (King, 2011; 
Lazer et al., 2009; Aggarwal, 2013). Several studies have exploited online social 
media (i.e., Facebook, Instagram, Twitter). In particular, Twitter analysis has been 
used to distinguish cultural traits (Golder and Macy et al., 2011), as well as a 
multitude of aspects, ranging from political polarization (Conover et al., 2011) and 
polls (O’Connor et al., 2010) to finance (Bollen et al., 2011). Tweets have also 
proven useful in the analysis of sentiment (Pang and Lee, 2008), as well as in 
distinguishing emotions (Mohammad et al., 2013) or different kinds of irony (Sulis 
et al., 2016; Hernandez-Farias et al., 2016). These kinds of digital traces have 
already been used to study human behaviour. For example, web searches have been 
used to predict the spread of infectious diseases (Ginsberg et al., 2010); email has 
been used to track migration (Zagheni and Weber, 2012), and mobile phones for 
daily life patterns (Gonzalez et al., 2008), as well as for economic development 
(Eagle et al., 2010). We, instead, focus here on the nexus between fertility and 
subjective wellbeing (SWB) by using filtered Twitter data in Italian. In particular, 
we investigate opinions and semantic orientation for fertility and parenthood. 

There has been a recent increase in studies on subjective wellbeing and 
fertility (Clark et al. 2008; Kohler et al. 2005; Myrskylä & Margolis 2014). While 
these studies provide important information on the dynamics that link subjective 
wellbeing and childbearing and childrearing, they can only provide limited insights 
into the substantive role SWB plays in terms of individual fertility behaviour. 
Therefore, it can be difficult to explain fertility change without greater insight into 
the nature of SWB, and how it is discussed in relation to fertility. In this context, we 
want to understand whether social media content, and in particular Twitter data, can 
be exploited for investigating the opinions and semantic orientation around fertility 
and parenthood. This approach may provide new insights into the SWB-fertility 
nexus. 

Using Twitter data, SWB can be read indirectly. In particular, we propose a 
novel methodological framework relying on Natural Language Processing (NLP) 
techniques for text analysis and social media corpora development, which is aimed 
at extracting sentiments or moods, which in turn can be used to construct indirect 
SWB measures. This is, of course, different from survey questionnaires, where 
respondents typically report their wellbeing on a grading scale; and where skewed 
distribution is the norm, with few people reporting very low levels of SWB. With 
Twitter individuals’ opinions are posted spontaneously and often as a reaction to 
some emotionally-driven observation. Moreover, using Twitter we can incorporate, 
into our analysis, additional measures of attitudes towards children and parenthood. 
This offers wider geographical coverage than is found in normal survey information. 
As a reference dataset, we adopted all the tweets posted in Italian in 2014 from the 
TWITA collection (Basile and Nissim, 2013). A multi-step methodology was 
established in order to filter and select the relevant tweets concerning fertility and 
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parenthood. Then, in order to enable a deeper and more finely-grained analysis of 
sentiment-related phenomena for fertility and parenthood, a multi-layered manual 
annotation was applied to a random sample of the selected data. Here sentiment and 
irony on parenthood-related topics were annotated. One of the novelties of the 
semantic annotation scheme we created is that it allowed us to mark up information 
not only for sentiment polarity, but also for the specific semantic areas/sub-topics 
that may be the target of sentiment in the analysis of the link between SWB, 
parenthood, and fertility. This is a necessary first step in enabling further analysis of 
this kind of content. 

The corpus was also analysed with sentiment and emotion lexicons in order 
to highlight relationships between the use of affective language and specific sub- 
topics. This analysis is useful per se, but it is also functional in addressing the 
automatic sentiment classification task. The annotated corpus is available to the 
research community. Its development constitutes only a first step and is a 
precondition for further analysis. Further analysis would involve extracting from the 
corpus, which includes semantically enriched data, measures of SWB constructed in 
an indirect way, which might improve our understanding of attitudes to fertility and 
parenthood. 


TW-SWELLFER: Dataset and Annotation Methodology 


As a reference dataset, we adopted all the tweets posted in Italian language in 2014, 
which were retrieved through the Twitter Streaming API and applying the Italian 
filter proposed within the TWITA project (Basile and Nissim, 2013). The dataset 
includes 259,893,081 tweets (4,766,342 geotagged). We applied a multi-step 
methodology in order to filter and select those relevant tweets concerning fertility 
and parenthood. We could not rely on the exploitation of one or few hashtags or 
other elements that allow identifying posts on fertility and parenthood. In fact, these 
topics are somehow spread in the dataset and messages may contain relevant 
information on such subjects even if the main topic of the post is different. We are 
facing a situation where, on the one hand, the set of the data that are potentially 
relevant for our specific analysis is wider than usual; on the other hand, it is more 
difficult to identify the presence of information related to the topics we are 
interested in. In a first step, eleven hashtags! and other nineteen keywords have been 
chosen for selecting tweets of interest. This list is the result of a combination of a 
manual content analysis and a linguistic analysis on synonyms. We obtain a total 
amount of 3.9 million tweets. A second filtering step consisted in removing noisy 
tweets from corpus. Tweets posted by companies/institutions/newspapers accounts 
have been deleted: they are messages not concerning individual expressions. Finally, 
duplicated tweets not marked as RT were deleted. 


! #papa, #mamma, #babbo, #incinta, #primofiglio, #secondofiglio, #futuremamme, #maternita, #paternita, 
#allattamento, #gravidanza. 
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1.1 Annotation scheme and annotation process 


We developed and applied to our dataset an annotation model aimed at studying two 
aspects: the polarity of sentiment expressed in the tweets, but also specific 
parenthood-related topics discussed in Twitter that are the target of the sentiment. 
Sentiment polarity. To build our annotation model, we relied on a standard 
annotation scheme on sentiment polarity (POLARITY), by exploiting the same 
labels POS, NEG, NONE and MIXED provided the organizers of the shared task for 
sentiment analysis in Twitter for Italian (Basile et al., 2014). Also the 
presence/absence of irony has been marked in order to be able to reason on 
sentiment polarity also in case of use of figurative devices. In order to mark irony, 
we introduced two polarized ironic labels: HUMNEG, for ironic tweets with 
negative polarity, and HUMPOS for ironic tweets with positive polarity. 
Parenthood-related semantic areas. A set of labels marks the specific 
semantic areas (or SUBTOPICS) of the tweets related to the parenthood domain. 
This part of the annotation scheme is very important since somehow provides us 
with a semantic grid in order to analyse which are the aspects of parenthood that are 
discussed on Twitter. We considered 7 labels, suggested by a group three experts on 
the subjective well-being and fertility domain, after a manual analysis of a subset of 
the tweets: TOBEPA - Being parents (to mark when the user generically comments 
about his status of parent; TOBESO - Being sons/daughters (to mark the when the 
user is a son/daughters that comments on the parent-son/daughters relationship; 
DAILYLIFE - Daily life (to mark messages commenting on recurring situation in 
everyday life in the relationship between parents and children); JUDGOTHERPA - 
Judgment over other parents behaviour (to mark comments on educations of 
children, e.g., comments of behaviours which does not seems to be appropriated for 
the parent role; FUTURE - Children’ future (to mark tweets where parents do 
express sentiments about the future of children; BECOMPA - To become parents ( 
to mark tweets where users speak about the prospect or fear of being parents; POL - 
Political side (to mark tweets talking about laws having impact on being parents. 
Two additional tags (IN-TOPIC/OFFTOPIC) have been added to allow 
annotators to mark if the tweet is relevant. The addition of this tag was necessary 
because of the noise still present in the dataset. Furthermore, the manual annotation 
will produce also data to be used in order to create a supervised topic classifier from 
the whole TW-SWELLFER corpus. 
A random sample of 5,566 tweets from TW-SWELLFER has been collected. 
On this sample we applied crowdsourcing for manual annotation via the 
Crowdflower platform!. We relied on CrowdFlower controls to exclude unreliable 
annotators and spammers based on hidden tests created by developing a set of gold- 
standard test questions equipped with gold reasons. The annotator’s task was, first, 
to mark if the post is IN- or OFF-TOPIC (or unintelligible), and then to mark for IN- 
TOPIC posts, on the one hand, the polarity and presence of irony, on the other hand, 
the subtopics. Precise guidelines were provided to the annotators. 


! https://www.crowdflower.com/ 
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Overall, for each tweet at least three independent annotations were collected. 
We used majority voting to select the true label. We obtained the following results. 
In-topic vs off-topic. Manual annotation on this aspect resulted in 2,355 in-topic 
tweets (42.3%) and 3,136 off-topic (56.3%); the remaining 75 tweets were discarded 
(cases of disagreement). Thanks to the preliminary filtering steps, the proportion of 
in-topic tweets is pretty high compared to common results from different Twitter 
based content and opinion analysis (Ceron et al., 2014). 
Polarity, irony, sub-topics (in-topic tweets). We obtained 1,545 tweets labeled 
with the same tags for all the layers (POLARITY, IRONY and SUBTOPICS). We 
call it the TW-SWELLFER-GOLD corpus. 


Analysis 


Regarding IN-TOPIC tweets (2,355 posts), the 26.4% has been labeled as positive 
and 22.3% as negative, giving us a guidance on what might be the general feeling in 
Twitter about the research topics on happiness and parenthood. The irony issue is 
limited to a 15.7% of all the messages and negative irony prevails (10.1% of 
negative ironic tweets and 5.6% of positive ironic tweets), while neutral tweets are 
just the 8.3%. The amount of mixed tweets is limited to 1.2% (remaining 26% are 
labelled as NULL because annotators didn’t agree on polarity, irony and subtopics 
labels). Overall, it seems that positive and negative feelings towards family, 
parenthood and fertility appear more or less equally spread through Twitter Italy. 
Even if the positive posts are a little bit more than the negative ones, ironic tweets 
must be considered: most of them are negative ironic posts (i.e., insulting/damaging 
the target) balancing the slight difference between pure positive and negative tweets. 
Furthermore, this particular topic, combined with the nature of communication in 
Twitter via short direct message, discourages people to stand in the grey (neutral) 
area, as could happens in other cases: about the 90% of the tweets shows an explicit 
polarity, meaning that people take a side and express their opinions. 

Which are these opinions and about what? Going further with the analysis 
and looking also at the contents, so taking into consideration the “topic specification 
attribute and its values (Fig. 1), the largest category refers to sons tweets (TOBESO, 
40.3%), in which children are discussing and posting about being children and/or 
about relating themselves with parents. Parents tag (TOBEPA) settles on 15% and 
becoming tag (BECOMEPA) on 10%. Remaining categories have minor impact, all 
being in between 1% and 6% (e.g., JUDGOTHERPA, 6,5%; DAILYLIFE: 5,6%). 


1.2 Sentiment and emotion analysis 


We performed a lexical analysis on the annotated corpus which concerns different 
aspects of affect: sentiment and emotions. As we will see, the distribution of terms 
in each group of messages reveals interesting patterns. 
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The whole polarity of messages has been computed by exploiting four 
existing sentiment lexical resources (Nissim and Patti, 2016) and summing positive 
and negative terms. A normalization is finally performed, i.e. dividing the polarity 
value by the number of terms in each group. In particular, the four lexica considered 
(LIWC, HuLiu, Emolex and Afinn!) count more positive terms in positive messages. 
Similarly, negative terms are more frequent in negative messages. Ironic messages 
reveal a similar pattern, even if smoothed. Table 1 presents some results. 


Table 1: Polarity values according to different lexicons in tweets tagged with different labels. 


Tag polarityLIWC _ polarityHuLiu _polarityEmoLex _ polarityAfinn 
POS 1.062 0.220 0.621 3.512 
NEG -1.609 0.037 0.122 0.390 
HUMPOS 0.194 0.122 0.225 2.293 
HUMNEG -0.336 0.078 0.637 0.610 
BECOMEPA 1.502 0.732 0.182 -1.643 
TOBESO 1.969 0.876 0.018 1.561 
FUTURE 0.931 0.079 0.174 -2.058 
TOBEPA 1.939 1.379 0.178 5.036 
JUDGOTHERPA 1.883 0.896 0.118 -1.110 


The emotion lexicon indicates also larger frequency of terms related to anger, 
sadness, fear and disgust in negative messages than in positive ones (Fig. 2, left). 
Instead, messages contain more terms related to joy, anticipation and surprise. Some 
suggestions can be derived in the comparison of polarity categories and the 
corresponding ironic ones. For instance, terms related to joy are more frequent in 
ironic negative messages than in negative ones. It is an insight of the polarity 
reversal phenomena, where a shift is produced by the adoption of a seemingly 
positive statement, to reflect a negative one (Sulis et al., 2016). 


Figure 1: Distribution of emotions by polarity (left) and sub-topics (right). 
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The analysis of sub-topic specifications reveals a positive polarity for messages 
concerning TOBEPA, while BECOMEPA has a more negative polarity (Table 1). 
Focusing on the emotion lexicon, TOBEPA has an higher incidence of Joy words 


! LIWC(http://liwc.wpengine.com/); Hu&Liu (Hu and Liu, 2004); AFINN (Nielsen, 2011); Emolex 
(Mohammad et al., 2013). 
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(Fig. 2, right). Messages concerning educations of children (JUDGOTHERPA) 
contain a high frequency of anger and disgust term. The category TOBESO is more 
controversial, having the higher frequency of negative terms as fear, but also trust, 
as well as having the lower frequency of Joy terms. Coherently, anticipation is more 
frequent in the BECOMEPA group of messages. Overall, it seems that daughter and 
sons are more critics toward parents, whereas. parents seem to express a more 
positive attitude towards their daugthers and sons. 


Conclusions 


The contribution of this paper is the exploration of opinions and semantic 
orientations related to fertility and parenthood as found in about three million Italian 
tweets. To this end, we developed a Twitter corpus of social media contents. This 
corpus was, then, annotated with a novel semantic annotation scheme not only for 
sentiment polarity, but also for the specific semantic areas/sub-topics which were the 
target of sentiment in the fertility-SWB domain. The corpus was further analysed by 
using sentiment and emotion lexicons in order to highlight the relationships between 
the use of affective language and specific sub-topics in the fertility-SWB domain. 

In addition, this work brings Italy into the debate on the nexus between 
subjective wellbeing and fertility. Italy, in fact, has been excluded from ongoing 
research on the topic because of a lack of suitable longitudinal data (Frey and 
Stutzer 2000, Kohler et al. 2005; Clark et al. 2008; Myrskylä and Margolis 2014). 
More must be done in order to enable a fruitful exploitation of these data, for 
demographic purposes. It would be particularly important to extract the information 
about the educational and socio-demographic traits of users in the dataset. 
Investigations into the relationship between social media data and official statistics 
is also a promising direction. By using the geocodes associated with tweets, research 
can link major — positive and negative — signals stemming from the sentiment 
analysis of the resident population in a given area (Italian provinces or NUTS-3 
level) with the socio-economic characteristics of that area and the presence of 
childcare services. In addition, further investigations might exploit the information 
about the specific semantic areas considered in the present study. Aggregating geo- 
referenced messages into administrative areas, other interesting correlations can be 
detected. This analysis might shed light on the use of social media content in 
predicting demographic variables. 
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Space-Time Analysis of Movements in 
Basketball using Sensor Data 


Un’analisi dei movimenti spazio-temporali nella 
Pallacanestro con l’utilizzo di dati provenienti da sensori 


Rodolfo Metulini and Marica Manisera and Paola Zuccolotto 


Abstract Global Positioning Systems (GPS) are nowadays intensively used in Sport 
Science as they permit to capture the space-time trajectories of players, with the aim 
to infer useful information to coaches in addition to traditional statistics. In our ap- 
plication to basketball, we used Cluster Analysis in order to split the match in a 
number of separate time-periods, each identifying homogeneous spatial relations 
among players in the court. Results allowed us to identify differences in spacing 
among players, distinguish defensive or offensive actions, analyze transition proba- 
bilities from a certain group to another one. 

Abstract / sistemi di posizionamento globali (GPS) sono ampiamente utilizzati in 
campo sportivo in quanto ci permettono di rilevare in diversi istanti temporali il 
posizionamento dei giocatori in campo, allo scopo di fornire indicazioni utili in ag- 
giunta alle statistiche tradizionali. Con un’applicazione sulla pallacanestro, utilizzi- 
amo una Cluster Analysis allo scopo di suddividere la partita in gruppi omogenei in 
termini di relazioni spaziali tra giocatori. Identifichiamo inoltre se ciascun gruppo 
corrisponde ad azioni di attacco o di difesa, e stimiamo le matrici di transizione che 
quantificano la probabilità di passaggio da un gruppo ad un’altro. 
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1 Introduction 


Studying the interaction between players in the court, in relation to team perfor- 
mance, is one of the most important issue in Sport Science. In recent years, thanks to 
the advent of Information Technology Systems (ITS), it became possible to collect a 
large amount of different types of spatio-temporal data, which are, basically, of two 
kinds. On the one hand, play-by-play data report a sequence of relevant events that 
occur during a match. Events can be broadly categorized as player events such as 
passes and shots; and technical events, for example fouls and time-outs. Carpita et al. 
[1, 2] used cluster analysis and principal component analysis in order to identify the 
drivers that affect the probability to win a football match. Social network analysis 
has also been used to capture the interactions between players [3]; Passos et al. [4] 
used centrality measures with the aim of identifying central players in water polo. 
On the other hand, object trajectories capture the movement of players or the ball. 
Trajectories are captured using optical- or device-tracking and processing systems. 
Optical systems use cameras, the images are then processed to compute the trajecto- 
ries [5], and commercially supplied to professional teams or leagues [6, 7]. Device 
systems rely on devices that infer location based on Global Positioning Systems 
(GPS) and are attached to the players’ clothing [8]. The adoption of this technology 
and the availability of data is driven by various factors, particularly commercial and 
technical. Even once trajectories data become available, explaining movement pat- 
terns remains a complex task, as the trajectory of a single player depends on a large 
amount of factors. The trajectory of a player depends on the trajectories of all other 
players in the court, both teammates and rivals. Because of these interdependencies 
a player action causes a reaction. A promising niche of Sport Science literature, bor- 
rowing from the concept of Physical Psychology [9], expresses players in the court 
as agents that face with external factors [10, 11]. In addition, typically, there are cer- 
tain role definitions in a sports team that influence movement. Predefined plays are 
used in many team sports to achieve specific objectives; moreover, teammates who 
are familiar with each other’s playing style may develop productive interactions that 
are used repeatedly. Experts want to explain why, when and how specific movement 
behavior is expressed because of tactical behavior and to retrieve explanations of 
observed cooperative movement patterns. A common method to approach with this 
complexity in team sport analysis consists on segmenting a match into phases, as it 
facilitates the retrieval of significant moments of the game. For example, Perin et al. 
[12] developed a system for visual exploration of phases in football. 

The aim of this paper is to study the spatial pattern of the players in the court and 
contribute, with our results, to the literature of data-mining methods for trajectories 
analysis in team sports, with the final objective of suggesting new useful strategies to 
improve the team’s performance. Using a basketball case study, and having available 
the spatio-temporal trajectories extracted from GPS tracking systems, we applied a 
cluster analysis in order to identify different game phases allowing us to characterize 
the spatial pattern of the players in the court. Each cluster defines a game phase, 
because it groups all the moments being homogenous in terms of spacings among 
players. First, we characterize each cluster in terms of players’ position in the court. 
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Then, we define whether each cluster corresponds to defensive or offensive actions 
and compute the transition matrices in order to examine the probability of switching 
to another group from time ¢ to time t+ 1. 


2 Data and Methods 


Basketball is a sport generally played by two teams of five players each on a rectan- 
gular court (28mx15m). The match, according to International Basketball Federation 
(FIBA) rules, lasts 40 minutes, and is divided in four periods of 10 minutes each. 
The objective is to shoot a ball through a hoop 46cm in diameter and mounted at a 
height of 3.05m to backboards at each end of the court. The data we used in the anal- 
yses that follow refers to a friendly match played on March 22th, 2016 by a team 
based in the city of Pavia (Italy). This team played the 2015-2016 season in the C- 
gold league, the fourth league in Italy. Totally, six players took part to the friendly 
match. All those players worn a microchip in their clothings. The microchip collects 
the position (in pixels of 1 m°?) in both the x-axis and the y-axis, as well as in the 
z-axis (i.e. how much the player jumps). The positioning of the players has been 
detected at millisecond level. Considering all the six players, the system recorded 
a total of 133,662 space-time observations ordered in time. In average, the system 
collects positions about 37 times every second. Considering that six players are in 
the court at the same time, the position of each single player is collected, in average, 
every 162 milliseconds. x-axis (length) and y-axis (width) coordinates have been fil- 
tered with a Kalman approach. The Kalman filtering is an algorithm used to predict 
the future state of a system based on the previous ones, in order to produce more 
precise estimates. We cleaned the dataset by dropping the pre-match, the half-time 
break and the post-match periods. We completed the dataset by replacing all the 
missing coordinates, referred to the milliseconds that were not detected, with the 
value of the coordinates of the first previous instant with non-missing values. We 
then reshaped the dataset in order to obtain a data matrix with rows uniquely iden- 
tified by the millisecond and columns devoted to the players’ variables. The final 
dataset counts for 3,485, 147 total rows. We applied a k-means Cluster Analysis in 
order to group a set of objects. Cluster analysis is a method of grouping a set of ob- 
jects in such a way the objects in the same group (clusters) are more similar to each 
other than to those in other groups. In our case, the objects are represented by the 
time instants, expressed in milliseconds, while the similarity is expressed in terms of 
players’ distance! . Based on the value of the between deviance (BD) / total deviance 
(TD) ratio and the increments of this value by increasing the number of clusters by 
one, we chose k=8 (BD/TD=50% and relatively low increments for increasing k, for 
k>8) 


' Tn the analyses that follows, for the sake of simplicity, we only consider the period where player 
3 was in the bench. 
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3 Results 


The first cluster (C1) embeds 13.56% of the observations (i.e. 13.56% of the total 
game time). The other clusters, named C2, ..., C8, have size of 4.59%, 14.96%, 
3.52%, 5.63%, 35.33%, 5.00% and 17.41% of the total sample size, respectively. 
We used Multidimensional Scaling (MDS) in order to plot the differences between 
the groups in terms of positioning in the court. With MDS algorithm we aim to place 
each player in N-dimensional space such that the between-player average distances 
are preserved as well as possible. Each player is then assigned coordinates in each of 
the N dimensions. We choose N=2 and we draw the related scatterplots as in Figure 
1. We observe strong differences between the positioning pattern among groups. In 
C1 and CS players are equally spaced along the court. C6 also highlights an equally 
spaced structure, but the five players are more closed by. In other clusters we can 
see a spatial concentration: for example in C2 players 1, 5 and 6 are closed by while 
in C8 this is the case of players 1, 2 and 6. Figure 2 reports cluster profile plots and 
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C5 569% C6 36.33% CT 5% C8 17,41% 


Fig. 1 Map representing, for each of the 8 clusters, the average position in the x— y axes of the 
five players, using MDS. 


helps us to better interpret the spacing structure in Figure 1, characterizing groups in 
terms of average distances among players. Profile plot for C6 confirms that players 
are more close by, in fact, all the distances are smaller than the average distance. 
At the same way, C2 presents distances among players 1, 5 and 6 smaller than the 
average. After having defined whether each moment corresponds to an offensive or 
a defensive action looking to the average coordinate of the five players in the court, 
we also found that some clusters represent offensive actions rather than defensive. 
More precisely, we found that clusters C1, C2, C3 and C4 mainly correspond to 
offensive actions (respectively, for the 85.88%, 85.91%, 73.93% and 84.62% of the 
times in each cluster) and C6 strongly corresponds to defensive actions (85.07%). 
Figure 3 shows the transition matrix, which reports the relative frequency in which 
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Fig. 2 Profile plots representing, for each of the 8 clusters, the average distance among each pair 
of players. 


subsequent moments in time report a switch from a cluster to a different one. It 
emerges that for the 31,54% of the times C1 switches to a new cluster, it switches to 
C3, another offensive cluster. C2 switches to C3 for the 42.85% of the times. When 
the defensive cluster (C6) switches to a new cluster, it switches to C8 for the 56.25% 
of times. 


NA 1 2 3 4 5 6 7 8 
1 0 1071 2353 4783 0 2083 3125 2023 
2 077 0 9.15 0 185 208 833 289 
3 31.54 42.86 0 87 4444 2083 18.75 20.23 
4 615 357 196 0 TAI 0 1042 116 
5 O77 357 16.99 17.39 0 O 16.67 8.09 
6 2769 714 1895 0 185 0 0 43.93 
7 19.38 2143 3.92 435 18.52 0 0 347 
B 1769 1071 2549 2174 2593 5625 1458 o 


Fig. 3 Transition matrix re- 
porting the relative frequency 
subsequent moments (t, t + 1) 
report a switch from a group 
to a different one. 


4 Conclusions and future research 


In recent years, the availability of ‘big data" in Sport Science increased the pos- 
sibility to extract insights from the games that are useful for coaches, as they are 
interested to improve their team’s performances. In particular, with the advent of 
Information Technology Systems, the availability of players’ trajectories permits to 
analyze the space-time patterns with a variety of approaches: Metulini [13], for ex- 
ample, adopted motion charts as a visual tool in order to facilitate interpretation of 
results. Among the existing variety of methods, in this paper we used a cluster analy- 
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sis approach based on trajectories’ data in order to identify specific pattern of move- 
ments. We segmented the game into phases of play and we characterized each phase 
in terms of spacing structure among players, relative distances and whether they 
represent an offensive or a defensive action, finding substantial differences among 
different phases. These results shed light on the potentiality of data-mining methods 
for trajectories analysis in team sports, so in future research we aim to i) extend 
the analysis to multiple matches, ii) match the play-by-play data with trajectories in 
order to extract insights on the relationship between particular spatial patterns and 
the team’s performance. 
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An ordinal Latent Markov model for the 
evaluation of health care services 


Un modello Latent Markov ordinale per la valutazione di 
servizi assistenziali 


Montanari Giorgio E., Doretti Marco and Bartolucci Francesco 


Abstract This work studies the dynamic behavior of the health status of some el- 
derly hosted in different nursing homes. Specifically, we consider a dataset gathered 
from the Long Term Care Facilities (LTCF) Programme, a longitudinal study car- 
ried on in Umbria (Italy). The final goal of our analysis is to understand whether the 
evolution of elderly’ health conditions significantly change across different nursing 
homes. To this end, an ordinal Latent Markov model accounting for both dropout 
and intermittent missing data patterns is proposed. Then, some performance mea- 
sures are computed on a standardized elderly population in order to rule out the 
effect of patient case-mix. 

Abstract Questo lavoro analizza il comportamento dinamico dello stato di salute 
di anziani ricoverati in varie case di cura. A questo scopo si analizzano dati longi- 
tudinali provenienti dal Programma Long Term Care Facilities (LTCF), realizzato 
in Umbria (Italia). L’obiettivo finale é valutare se l’evoluzione delle condizioni di 
salute dei pazienti varia significativamente tra le case di cura. A tal fine, si propone 
l’utilizzo di un modello Latent Markov ordinale che consideri la presenza di dati 
mancanti per abbandono dello studio o altri motivi. Nell’applicazione, per cias- 
cuna casa si propongono alcuni indicatori di risultato corretti per il case-mix. 


Key words: Health care services; Latent Markov models; Longitudinal data. 


1 Introduction 


Health care is one of the most relevant concerns for regional governments in Italy. 
In Umbria, a region of central Italy, public programs exist which aim to take care of 
specific population segments such as elderly or disabled people. Clearly, it is of im- 
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portance for policy makers to have a deep understanding of the general health status 
of these people and of the effectiveness of health care provided. To this end, we fo- 
cus on the Long Term Care Facilities (LTCF) protocol, a program mainly addressed 
to elderly people hosted in regional nursing homes (NHs). Through the administra- 
tion of specifically designed questionnaires, some information is periodically col- 
lected on these people to monitor their physical and psychological conditions as 
well as the care services provided by NHs. In this work, using data coming from 
the LTCF dataset, we propose an ordinal Latent Markov model accounting for both 
dropout and intermittent missing data patterns aimed at understanding whether the 
evolution of elderly’ health conditions significantly change across different NHs. 


2 The LTCF data 


LTCF data are collected through a questionnaire routinely administered approxi- 
mately every six months. The questionnaire is formed by several sections dealing 
with different aspects of the health status such as cognitive conditions, humour and 
behavioral disorders or problems with Activities of Daily Living (ADL). Therefore, 
when the entire questionnaire is considered, the underlying trait is multidimensional. 
However, as a first step of a more complex data analysis to be developed, in this work 
we consider a single section of the questionnaire, namely the ADL section. The lat- 
ter includes ten ordinal items that we suppose to be the outcome of an ordinal latent 
trait. Indeed, they report the level of difficulty patients experience in taking simple 
actions like getting dressed, walking or stooping, using the bathroom or eating by 
themseves. 

The sample we consider covers the years 2012 and 2013 and includes n = 1292 
patients hosted in 41 NHs whose number of patients ranges from 5 to 96. Ideally, 
there should be T = 4 measurement occasions for each patient. However, this is not 
always the case as dropout due to death or discharge might occur. Furthermore, in- 
termittent missingness is also present (missing occasions between valid occasions). 
As is common in longitudinal studies, we need to carefully evaluate the missingness 
mechanism in the context we deal with. In this work, we treat intermittent miss- 
ing data as well as dropout due to discharge as missing at random [3]. This choice 
seems a reasonable one as the former have an unknown cause, while motivations 
for discharge are various. On the other hand, dropout due to death is clearly non- 
ignorable as death is associated to a worsening of health status. This non-ignorable 
mechanism must be accounted for somehow in the model. 


3 The model 


Latent Markov (LM) models are of use when some categorical outcome variables 
are measured at a number of time occasions. These outcomes (i.e., questionnaire 
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items) are assumed to be probabilistically influenced by an unobserved process (i.e., 
health status), which is modeled like a first order discrete-time Markovian process 
with a finite number of states; see [1]. Three sets of parameters characterize this 
structure: conditional response probabilities (probabilities of specific outcome cat- 
egories given the latent state), initial probabilities (probabilities of latent states at 
the first measurement occasion) and transition probabilities (probabilities of latent 
states at following occasions given previous latent state). 

Assume we have data on n independent units, indexed by i = 1,...,n, and that 
unit i is observed at T; measurement occasions, with T; < T = 4. y” denotes the 
response vector of unit i at occasion ¢ (t = 1,...,7;). Such a vector includes J uni- 
variate categorical responses, that is Y © = FO, VOIR RADI In principle, each 
(0) 
j 
to cj. Similarly, the Markovian latent process is denoted by V; = Vv, Mia Vi), 
where, at each time occasion t, Vv? is a categorical variable with k levels. Each yl ) 
is assumed to be independent of any other variable in the model conditionally on 


indicator Y, might have a generic number of response categories indexed from 1 


AL Therefore, the parameters of interest are the conditional response probabilities 


Piyjv Py? NIVAL v) j=1,...,J, YS 1,0; v=1,...,k, 


which are constant with respect to time. As discussed in Section 2, y® can be 
thought of as an ordinal variable. As a consequence, a global logit parametrization 


Ọjm+1,v Pret Picjv 
dj Arsa Pjmy 


G=1,...,Jj;m=1,...,c;-—1;v=1,...,k) can be imposed. In Equation (1), it is 
assumed that Tj > +--+ > Tje;-1 for j = 1,...,J and that ôi < +++ < è, with ô = 0. 
This parametrization fixes the direction of the association between the responses 
and the latent variable. In so doing, label switching - a well-known problem in this 
class of models - is ruled out. In this case, such association is positive so that higher 
latent states correspond to increasing difficulties in the ADL. 

We also allow initial and transition probabilities to depend on individual covari- 


= Tim H ô, (1) 


ates denoted by X; = (X LA 1X m), These include personal characteristics as age 
and gender as well as binary indicators for NH membership. The latter allow to 
model the nursing home effect on the initial and transition probabilites for evalua- 


tion purposes. Specifically, we set 
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(@,v=1,...,k;t =2,...,7;). Relying on the ordinal nature of the latent process Vj, 
the models for the conditional initial and transition probabilities can be respectively 
expressed by the regression equations 


mvt D+ +7) 


lg a = ta B e) 
m (1)+---+7() 
©W=1,...,k- 1), and 
nO (v1) + +O (k|9) (1) 
log + È = Wø +x Y 6) 


PUPP (v|9) 


@=1,...,k;v=1,...,k—1;t=2,...,T;). In (2) and (3) a global logit parametriza- 
tion like in (1) is assumed. Under this parametrization, the covariate effects, repre- 
sented by the column vectors B and y, are constant across the logit equations while 
the sequences of thresholds é; > --- > éx_1 and @; >--- > @_; must be decreasing 
to ensure that the cumulative sums of probabilities along the ordered categories of 
the latent variables are increasing. Finally, notice that for identification purposes y 
is always set to 0. 

We extend the model described so far to account for the missingness mechanism 
occurring in our application. First, we add an extra response category cj + 1 ina 
way such that, in the case of death after the t-th occasion (t < T), we complete the 
data by setting ag =cj+1foru=t+1,...,T. Moreover, we define an extra ab- 
sorbing latent state k + 1 corresponding to the death. Consequently, some related ad- 
ditional probabilities have to be properly constrained. Specifically, for j = 1,...,J, 
v=1,...,kandt=2,...,T: 

° ni) (k+ 1) =0: no one can be in the extra latent state at the first occasion; 

e nl (K+1|K+1)= 1: no one can revert to other states from death; 

© Pjc;+1,v = 0: the extra category cannot be observed if one is not dead; 
© jc+14+1 = 1: only the extra category can be observed if one is dead. 


We also deal with intermittent missing data patterns by extending the set of co- 
variates explaining transition probabilities (i.e., Equation (3)). Specifically, we add 
a variable measuring the time interval - in days - from the previous occasion. Al- 
though intermittent missingness is assumed to occur at random (see Section 2), the 
introduction of such a covariate seems necessary to take into account the interval be- 
tween non missing occasions and get correctly interpretable results. Indeed, it allows 
the estimation of six-month ahead transition probabilities, with six months being the 
time interval between measurement waves originally designed in the LTCF study. 
Notice that the extension we propose to deal with missing data involves only two 
additional free parameters to estimate: the additional threshold @, and the regres- 
sion parameter in Y associated to the aforementioned additional covariate. The other 
thresholds have to be set to —° to satisfy the constraints above. Parameter estimates 
are obtained by means of the Expectation-Maximization algorithm [2], which is a 
standard estimation tool for LM models. 
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4 Results 


An important part of the model selection process in latent variable models is the 
choice of the number of latent states k. A commonly adopted strategy considers both 
formal criteria like the Bayesian Information Criterion (BIC) [4] and interpretability 
of results. For this application, we have fitted models with k ranging from 2 to 
10. After some comparisons, the model with k = 5 has been selected. This choice 
represents a compromise between the BIC criterion, interpretability of the resulting 
latent states, avoiding latent states with a very small number of patients. 

A summary of the estimated conditional response probabilities Pjyjv can be pro- 
vided by the standardized item score 


x 1 g N 5 
$ = E Oj- Dop, P E A poke ad: 


Specifically, ŝjy indicates on a 0-1 scale the difficulty of a patient in latent state v in 
taking the action described by item j. Table 1 reports the standardized item scores 
for the five-state model as well as their averages across latent states, denoted by 
§;. According to Table 1, the first difficulties are experienced in taking a bath or a 
shower, dressing the lower part of the body, and mantain personal hygene (respec- 
tively, items 1, 4 and 2), while the last difficulties are related to bed mobility and 
eating (item 9 and 10). 


item j 
1 2 3 4 5 6 7 8 9 10 
0.236 0.124 0.064 0.101 0.007 0.005 0.012 0.024 0.002 0.000 


v 
1 
2 0.477 0.421 0.340 0.420 0.129 0.107 0.187 0.272 0.052 0.007 
3 0.667 0.622 0.571 0.648 0.444 0.410 0.508 0.585 0.287 0.063 
4 
5 


0.876 0.835 0.810 0.871 0.788 0.767 0.806 0.851 0.621 0.304 
0.992 0.988 0.985 0.992 0.986 0.984 0.986 0.991 0.949 0.848 


5; 0.650 0.598 0.554 0.606 0.471 0.455 0.500 0.545 0.382 0.244 


Table 1 Standardized item scores and their averages 


Table 2 reports the estimated initial and transition probabilities for the latent pro- 
cess. These probabilities are averaged across the distribution of individual covari- 
ates. Note that patients are rather uniformly distributed across latent states at first 
occasion. 

Looking at the transition probabilities, the probability of persistence (i.e., proba- 
bility of remaining in the same latent state) decreases for higher states as the proba- 
bility of migrating towards worse states or the extra state d, death, increases. 

As regards the effect of the NH membership on the latent process, here we fo- 
cus on the NHs with the higher and lower effects on the probabilities of transition 
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initial probabilities transition probabilites 
1 2 3 4 5 p 1 2 3 4 5 d 


0.118 0.128 0.171 0.241 0.342 1 0.877 0.121 0.002 0.000 0.000 0.000 
2 0.049 0.686 0.251 0.013 0.001 0.000 
3 0.001 0.081 0.613 0.274 0.029 0.002 
4 
5 


0.000 0.004 0.097 0.504 0.347 0.048 
0.000 0.000 0.010 0.128 0.539 0.322 


Table 2 Estimated average initial and transition probabilities 


towards worse states. The estimated difference between these effects in the linear 
predictor is equal to 2.337, with a p-value lower than 1075. In Table 3 we report 
the corresponding estimated average six-month ahead transition probabilities com- 
puted on the same set of elderly. These are comparable transition matrices as they 
have been standardized over the same distribution of other covariates (age and gen- 
der) to vanish the patient case-mix effect. From Table 3 we conclude that there is a 
large difference between NHs in their ability to maintain the elderly in good health. 
Further investigations are needed to explain the reasons of these differences. 


lower effect NH higher effect NH 
vv 1 2 3 4 5 d 1 2 3 4 5 d 


1 0.959 0.040 0.001 0.000 0.000 0.000 0.566 0.423 0.011 0.000 0.000 0.000 
2 0.118 0.778 0.100 0.004 0.000 0.000 0.007 0.315 0.612 0.061 0.004 0.000 
3 0.003 0.186 0.685 0.117 0.008 0.001 0.000 0.012 0.265 0.578 0.133 0.011 
4 
5 


0.000 0.009 0.218 0.592 0.167 0.014 0.000 0.000 0.015 0.185 0.590 0.209 
0.000 0.001 0.025 0.270 0.568 0.136 0.000 0.000 0.001 0.021 0.238 0.739 


Table 3 Estimated six-month ahead transition probabilities for higher and lower nursing home 
effects 
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New fuzzy composite indicators for dyslexia 


Indicatori sintetici fuzzy per la diagnosi precoce della 
dislessia 


Isabella Morlini and Maristella Scorza 


Abstract Composite indicators should ideally identify multidimensional concepts 
that cannot be captured by a single variable. In this paper, we suggest a method 
based on fuzzy set theory for the construction of fuzzy synthetic indexes of dyslexia, 
using the set of manifest variables measured by means of reading tests. A few criteria 
for assigning values to the membership function are discussed, as well as criteria for 
defining the weights of the variables. An application regarding the diagnosis of 
dyslexia in primary and middle school in Italy is presented. In this application, the 
fuzzy approach is compared with the crisp approach actually used in Italy for 
detecting dyslexic children in compulsory school. 

Abstract La diagnosi precoce della dislessia nei bambini di età scolastica è di 
fondamentale importanza per poter garantire una didattica mirata e prevenire 
possibili effetti negativi sullo sviluppo della personalità. Attualmente in Italia la 
diagnosi della dislessia viene effettuata analizzando la velocità o l’accuratezza della 
lettura in test normativi. Poiché la dislessia è un fenomeno complesso che può 
essere misurato solo dall’analisi congiunta dei diversi aspetti inerenti le 
performances di lettura e non segue la suddivisione rigida fra “patologia presente” 
e “patologia assente” ma può manifestarsi con diversi livelli di gravità, in questo 
lavoro vengono proposti nuovi indici compositi fuzzy per misurare il grado della 
patologia nei bambini frequentanti la scuola primaria e secondaria di primo grado. 


Key words: composite indicator, dyslexia, fuzzy index, membership function, 
weighting criteria 
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1 Introduction 


Decoding ability in primary school in Italy and in countries with transparent orthography 
is currently assessed with the aid of standardized tests requiring the students to read aloud 
a selected list of words and non-words or a text. The most widely used standardized tests 
in Italy have been introduced by Sartori et al. (2007). Recently, a new screening procedure 
for identifying impaired decoders in elementary grades has been proposed by Morlini et 
al. (2014). What is important in the use of tests and screening procedures is the way the 
results are interpreted. One of the defining characteristic of a skilled decoder is that he or 
she not only is able to spell written words (or non-words) accurately, but also does so 
rapidly and automatically. An individual who spells accurately but very slowly cannot be 
considered a skilled decoder. Slow rate of word reading is then characteristic of impaired 
decoding as well as low accuracy, especially in transparent languages (World Health 
Organization (2008)). In Italy, decoding ability is assessed without taking into account 
both aspects and an individual can be classified as impaired because he or she is able to 
read words (or non-words) very rapidly, even though he or she misspells a fairly large 
number of words (or non-words). Individuals with weak decoding skills who are able to 
read a large number of words, provided they are given ample time, can be erroneously 
classified as adequate decoders. Many authors have outlined the necessity of considering 
both speed and accuracy for a valid assessment of decoding skills and a new challenge in 
learning disability research is to develop composite indicators that incorporate measures 
of speed as well as of accuracy (Morlini et al. (2015)). Since dyslexia is a vague concept 
and the rigid partition between impaired and not impaired readers does not always reflect 
reality, fuzzy theory should be used in defining these new indicators. 

In Section 2, we deal with the general problem of obtaining a synthetic fuzzy measure 
of a latent phenomenon like dyslexia from a set of metric variables. We present two 
criteria to transform the values of a variable into fuzzy numbers. In Section 3, we discuss 
the problem of weighting the variables and aggregating them into composite indicators. 
Clearly, the weights should reflect the contribution of each variable to the latent 
phenomenon. In Section 4, we focus on the specific application of measuring dyslexia in 
compulsory schools. The gradual transition from skilled to impaired readers can be 
captured by fuzzy indexes, as well as the risk for dyslexia. We apply the method to a 
sample of 3932 students attending elementary and middle schools in Italy. The fuzzy 
indicators of dyslexia allow us to obtain membership functions that can be compared with 
the result of the currently used diagnostic procedure, which strictly identifies a student as 
being “dyslexic” or “not dyslexic. 


2 The fuzzy approach 


Let X be a set of elements x e X. A fuzzy subset A of X is a set of ordered pairs: 
[x, Ha(x)] Va eX 
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where pu(x) is the membership function (m.f.) of x to A in the closed interval [0,1]. If 
L(x) = 0, then x does not belong to A, while if a(x) = 1, then x completely belongs 
to A. If 0 < pa(x) < 1, then x partially belongs to A and its membership to A increases 
according to the values of a(x). Let us assume that the subset A defines the position 
of each element with reference to achievement of the latent concept, e.g. dyslexia. In 
this case, a(x) = 1 identifies a situation of full achievement of the disease, whereas 
ua(x) = 0 denotes a person not sharing the disease (a very skilled decoder). Consider 
a set of n individuals and p metric variables X; (s = 1, 2 ,..., p) reflecting the latent 
phenomenon. In case of dyslexia, these variables are measures of reading 
performances like the time of reading in seconds, the number of misspelled words or 
the number of syllables read in a second. Without loss of generality, let us assume 
that each variable is positively related with that phenomenon, i.e. it satisfies the 
property “the larger the more impaired”. If a variable X; shows a negative correlation 
(like the number of syllables read in a second) we substitute it with the simple 
decreasing function transformation /(xs;)= max(Xsi) — Xsi. 

In order to define the m.f. for each variable it is necessary to identify the extreme 
situations such that pu(x) = 0 (non membership) and pu(x) = 1 (full membership) and 
to define a criterion for assigning the m.f. to the intermediate values. Many criteria 
has been proposed in literature, especially in the field of social sciences, for 
measuring latent concepts like well-being, satisfaction and poverty (see e.g. Zani et 
al. (2012) and (2013)). For the specific purpose of measuring dyslexia we will 
consider only two specifications. 

Let us assume that X; is a metric variable. For simplicity of notation in the 
following we will omit index s. We choose an inferior (lower) threshold / and a 
superior (upper) threshold u, with / and u finite, and we define the m.f. as follows: 


be 4 (4%) =0 x; <l 
x; —l 

beg (X= I<x;<u (1) 
u-l 

bg (x) =1 X; Zu 


In (1) the m.f. is a linear function between the values of the two thresholds. 
Alternatively, we define the m,f. asp ,(x,) = 1/(1 +d(x, )) , where d(x;) is the distance 


between the value x; and dyslexia, measuring the degree of impairment and 
indicating the level of the achievement of dyslexia. If d(xi) = 0, there is full 
membership to A and pu(x) = 1. If d(x) > 0, then pu(x) < 1. In general, the 
relationship between physical measures and perception takes an exponential form 
(Zimmerman (1993), Baliamoune-Lutz (2004)). If we assume that the relationship 
between physical measures and decoding impairment takes the same form, then the 
distance d(x;) can be expressed as d(x;) = e and the m.f. is then defined as: 


1 
Ha) ar (2) 
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The parameter a represents the extent of uncertainty and b may be viewed as the 
point in which the performance of the subject changes from “bad” to “pathological”. 


3 The fuzzy composite indicators 


The most general aggregation function of variables for obtaining a composite 


p 
indicator is the weighted generalized mean p , (7) = > [u A alw ©, where w, > 0 


S 
i=l 
is the normalized weight that expresses the relative importance of the variable Xs, 
with X ‚w, =1. For the sake of simplicity, we consider o=1, that is the weighted 


arithmetic mean. Furthermore, we consider different weights reflecting the 
importance of each variable in diagnosing dyslexia. Since previous studies (Morlini 
et al. (2014, 2015)) have shown that the first component of a principal component 
analysis (PCA) accounts for a high percentage of the total variance of the measures 
of speed and accuracy in psychometric reading test, we consider weights 
proportional to the correlation of each variable with the first component of a PCA. 

Alternatively, in order to attach to each variable a weight sensitive to the fuzzy 
membership, we consider the fuzzy proportion of each variable to the achievement 


of dyslexia g(X,) = 1S A and define the normalized weights as follows: 
N i 


1 p 1 
= In| ——— In| ———_ |. 3 
i d5] 2 a) < 


4 Fuzzy indicators for dyslexia: an application 


We administer the standardized tests Batteries for the Diagnosis of Reading and 
Spelling Disabilities (Sartori et al. (2007)) to 3932 students attending elementary 
(from grade II) and middle school in Lombardia and Emilia Romagna regions 
(Northern Italy). Table 1 reports the frequency distribution of students in each grade. 
In these tests the metric variables measuring performances of decoding are: 

Xı: time (in seconds) in reading the list of words 

X2: number of words mispronounced in reading the list of words 

X3: time (in seconds) in reading the list of non-words 

X4: number of incorrect pronunciations in reading the list of non-words 


Table 1: Frequency distributions of students in each grade 
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Grade 
Elementary school Middle school 
Number I II IV V VI VII VII 
of students 715 472 621 519 | 922 311 372 


o 


o 
Hem 
Ja 
Ho 

| = 

ym 

dA +4 
a S 

HI —oo0o0 «es. 
HIT .cocso. 
HI oo: 

Moo 


Grade Grade 


1000 


bogatii jae 


F I 
Grade Grade 


Figure 1: Distribution of each variable in each grade 


Figure 1 shows the distribution of each variable in each grade. We perform a PCA 
on the correlation matrix. The first component accounts for 66.5% of the total variance, 
is highly correlated with all variables and is the only one with eigenvalue greater than 
one. We construct the following fuzzy indicators: 

Fu: using m.f. (1) with / = xs% (the fifth percentile) and u = xos% (the 95" percentile) in 
each grade and weights proportional to the factor loadings of the first PCA. 

Fi: using m.f. (1) with / = xs% and u = x9sy in each grade and weights (3). 

Fy: using m.f. (2) with a = 0.5 and b = xo% (the 90" percentile) in each grade and 
weights proportional to the factor loadings of the first PCA. 

Foi: using m.f. (2) with a = 0.5 and b = xo9y in each grade and weights (3). 

Table 2 reports the frequency distribution of the values of the fuzzy indices. We 
may note that the differences in Fi; and Fı2 and in Fy; and Fo» are negligible and thus 
the weighting system do not substantially change the fuzzy indicator. On the other 
hand, the choice of the membership function influence the results. Applying the 
diagnostic criterion actually used in Italy for which a student is classified as impaired if 
he or she shows a value above normative cut-off in two or more variables, 4.8% of the 
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students is classified as dyslexic. The fuzzy indicators give more insight into this 
percentage. According to m.f. (1), about 2% is definitely dyslexic, while should be 
considered at high risk for impairment the 2.9%. Another approximately 4% may be 
viewed as being at medium risk. According to m.f (2), about 1% of the students are 
definitely dyslexic, while 1% is at high risk and approximately 1.6% at medium risk. 
We may also identify the prevalence of very skilled readers (64% according to m.f (1) 
and 89% according to m.f (2)) and the percentages of normal readers (given by the 
frequencies of values ranging from 0.4 to 0.7). 

In conclusion, this paper presents a methodology to build fuzzy composite 
indicators with the aim of considering both speed and accuracy of reading in the early 
diagnosis of dyslexia and with the aim of going beyond the rigid unrealistic partition 
between “dyslexic” and “not dyslexic” student. The limit between a “bad” and a 
“pathological” performance in psychometric reading tests is somehow fuzzy. The 
application shows that the proposed indices work well in identify the level of 
impairment of the students and the results are in agreement with the percentages of 
dyslexic identified with the traditional diagnostic criterion but give more insights. 


Table 2: Frequency distributions of the values of the fuzzy indices 

Fu Fir Fa F2 
0.0 - 0.4 0.648 | 0.643 | 0.895 | 0.894 
0.4 - 0.6 0.208 | 0.206 | 0.041 | 0.047 
0.6 - 0.7 0.059 | 0.062 | 0.028 | 0.024 
0.7 - 0.8 0.038 | 0.039 | 0.018 | 0.016 
0.8 - 0.9 0.029 | 0.029 | 0.009 | 0.009 
0.9 - 1.0 0.019 | 0.021 | 0.009 | 0.010 
Tot 1.000 | 1.000 | 1.000 | 1.000 
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Big Textual Data: Lessons and Challenges for 
Statistics 


Grandi dati testuali: le lezioni e le sfide per le statistiche 


Fionn Murtagh 


Abstract At issue are a few early stage case studies relating to: research publishing 
and research impact; literature, narrative and foundational emotional tracking; and 
social media, here Twitter, with a social science orientation. Central relevance and 
importance will be associated with the following aspects of analytical methodol- 
ogy: context, leading to availing of semantics; focus, motivating homology between 
fields of analytical orientation; resolution scale, which can incorporate a concept 
hierarchy and aggregation in general; and acknowledging all that is implied by this 
expression: correlation is not causation. Application areas are: research publishing 
and qualitative assessment, narrative analysis and assessing impact, and baselining 
and contextualizing, statistically and in related aspects such as visualization. 


Key words: mapping narrative, emotion tracking, significance of style, Correspon- 
dence Analysis, chronological hierarchical clustering 


1 Underlying Themes in Methodology, Introduction 


Clearly, through integration of analytical methodology and domain of application, 
the choice of methodology or even its development is dependent on the specific 
requirements. However the following general aspects of contemporary analytics, 
including textual data analytics, are useful to be noted. 

An interview with Peter Norvig, Google, in C. Anderson [1] contained the 
following controversial perspectives: “Petabytes allow us to say: ‘Correlation is 
enough’. We can stop looking for models. We can analyze the data without hypothe- 
ses about what it might show. We can throw the numbers into the biggest computing 
clusters the world has ever seen and let statistical algorithms find patterns where 
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science cannot.” “Correlation supersedes causation, and science can advance even 
without coherent models, unified theories, or really any mechanistic explanation at 
all.” 

To counteract this automation of all analytical reasoning, and to accept the need 
for inductive reasoning, there is: (1) Importance of: context (for our analytics); in- 
tegration of data and domain; leading to the following. (2) Semantic analytics, and 
analytical synthesis in, and from, data and information. (3) Qualitative as well as 
quantitative evaluation and related analytics. All in all, this is leading to the Corre- 
spondence Analysis platform as an inductive reasoning framework for other analyt- 
ical methodologies also. 

Interestingly, the focus on regions of interest in information space is stressed by 
[21]. An article about the Internet of Things and Big Data by John Thornhill in 
the newspaper, Financial Times, on 9 January 2017 had this comment: “Sir Nigel 
Shadbolt, co-founder of the Open Data Institute ... The next impending revolution, 
he argues, will be about giving consumers control over their data.” 

Ethical consequences of Big Data mining and analysis may be associated with 
the following, from [10]: “Rehabilitation of individuals. The context model is al- 
ways formulated at the individual level, being opposed therefore to modelling at an 
aggregate level for which the individuals are only an ‘error term’ of the model.” 

In [6], “There is the potential for big data to evaluate or calibrate survey findings 
... to help to validate cohort studies”. Examples are discussed of “how data ... tracks 
well with the official”, far larger, repository or holdings. It is well pointed out how 
one case study discussed “shows the value of using “big data to conduct research on 
surveys (as distinct from survey research)”. Limitations though are clear: “Although 
randomization in some form is very beneficial, it is by no means a panacea. Trial 
participants are commonly very different from the external pool, in part because of 
self-selection, ...”. This is due to, “One type of selection bias is self-selection (which 
is our focus)”. Important points towards addressing these contemporary issues in- 
clude the following. “When informing policy, inference to identified reference pop- 
ulations is key”: This is part of the bridge which is needed, between data analytics 
technology and deployment of outcomes. 

Furthermore there is this: “In all situations, modelling is needed to accommo- 
date non-response, dropouts and other forms of missing data. While “Representa- 
tivity should be avoided”, here is an essential way to address in a fundamental way, 
what we need to address: “Assessment of external validity, i.e. generalization to the 
population from which the study subjects originated or to other populations, will 
in principle proceed via formulation of abstract laws of nature similar to physical 
laws”. 

Hence our motivation for the following framework for analytical processes: Eu- 
clidean geometry for semantics of information; hierarchical topology for other as- 
pects of semantics, and in particular how a hierarchy expresses anomaly or change. 
A further useful case is when the hierarchy respects chronological or other sequence 
information. 
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2 Towards: Qualitative as well as Quantitative Research 
Effectiveness and Impact 


For analysis of research funding, of publishing, and of commercial outcomes, ac- 
count needs to be taken of measures of esteem. Also account is taken of research 
impact, through impact of research products: (1) research results, (2) organisation of 
science (journal editing, running conferences), (3) knowledge transfer, supervision, 
(4) technology innovations. 

Correspondence Analysis when based on part of an ontology or concept hier- 
archy can be considered as “information focusing”. Correspondence Analysis pro- 
vides simultaneous representation of observations and attributes. We project other 
observations or attributes into the factor space: these are supplementary or contex- 
tual observations or attributes. A 2-dimensional or planar view is an approximation 
of the full cloud of observations or of attributes. Therefore there can be benefit in the 
following: define a small number of aggregates of either observations or attributes, 
and carry out the analysis on them. Then project the full set of observations and 
attributes into the factor space. 

In support of “The Leiden Manifesto for research metrics”, DORA (San Fran- 
cisco Declaration on Research Assessment), Metrics Tide Report (HEFCE, Higher 
Education Funding Council England, 2015), qualitative judgement is primary. Re- 
search results may be assessed through first determining a taxonomic rank by map- 
ping to a taxonomy of the domain (a manual action). There there will be unsuper- 
vised aggregation of criteria for stratification. 

Research impact should be evaluated, first of all, based on qualitative considera- 
tions. Evaluation of research, especially at the level of teams or individuals can be 
organized by, firstly, developing and maintaining a taxonomy of the relevant sub- 
domains and, secondly, a system for mapping research results to those subdomains 
that have been created or significantly transformed because of these research results. 
Of course, developing and/or incorporating systems for other elements of research 
impact, viz., knowledge transfer, industrial applications, social interactions, etc., are 
to be taken into account also. 

See [19] for such work. Generally also see [5]. The latter maps out evolving 
vocabulary and associates this also with influential published articles. 


3 Qualitative Style in Narrative for Analysis and Synthesis of 
Narrative 


For [11], the composition of the movie, Casablanca, is “virtually perfect”. Text is 
the “sensory surface” of the underlying semantics. 

Here there is consideration as to how permutation testing and evaluation can be 
very relevant for qualitative appraisal. Considering the Casablanca movie, shot by 
Warner Brothers between May and August 1942, and also some early episodes of 
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the CSI Las Vegas, Crime Scene Investigation, television drama series, from the 
year 2000, the attributes used were as follows, [15]. 

All is based on the following: Euclidean geometry for semantics of information; 
hierarchical topology for other aspects of semantics, and in particular how a hier- 
archy expresses anomaly or change. The hierarchy respects chronological or other 
sequence information. Chronological hierarchical clustering, also termed contigu- 
ity constrained hierarchical clustering, is based on the complete link agglomerative 
clustering criterion [12, 2, 7]. 


1. Attributes 1 and 2: The relative movement, given by the mean squared distance 
from one scene to the next. We take the mean and the variance of these relative 
movements. Attributes 1 and 2 are based on the (full-dimensionality) factor space 
embedding of the scenes. 

2. Attributes 3 and 4: The changes in direction, given by the squared difference in 
correlation from one scene to the next. We take the mean and variance of these 
changes in direction. Attributes 3 and 4 are based on the (full-dimensionality) 
correlations with factors. 

3. Attribute 5 is mean absolute tempo. Tempo is given by difference in scene length 
from one scene to the next. Attribute 6 is the mean of the ups and downs of 
tempo. 

4. Attributes 7 and 8 are, respectively, the mean and variance of rhythm given by 
the sums of squared deviations from one scene length to the next. 

5. Finally, attribute 9 is the mean of the rhythm taking up or down into account. 


For permutation testing, assessment was carried out relative to uniformly ran- 
domized sequences of scenes or sub-scenes. 


4 Statistical Significance of Impact 


Underlying [18] is the testing of social media with the aim of designing interven- 
tions, associated with statistical assessment of impact. The application here is to 
environmental communication initiatives. Measuring impact of public engagement 
theory, in the sense of the eminent political scientist, Jiirgen Habermas, involves 
public engagement centred on communicative theory; by implication therefore, dis- 
course as a possible route to social learning and environmental citizenship. 

The case study here, was directed towards: 


. Qualitative data analysis of Twitter. 

. Nearly 1000 tweets in October, November 2012. 
. Evaluation of tweet interventions. 

. Eight separate twitter campaigns carried out. 


AUNE 


Mediated by the latent semantic mapping of the discourse, semantic distance 
measures were developed between deliberative actions and the aggregate social ef- 
fect. We let the data speak in regard to influence, impact and reach. 
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Impact was algorithmically specified in this way: semantic distance between the 
initiating action, and the net aggregate outcome. This can be statistically tested 
through the modelling of semantic distances. It can be further visualized and evalu- 
ated. 

A fundamental aspect of the Twitter analysis was how a tweet, considered as 
a “campaign initiating tweet”, differed from an aggregate set of tweets. The latter 
was the mean tweet, where the tweets were first mapped into a semantic space. The 
semantic space is provided by the factor space, which is endowed with a Euclidean 
metric. For very high dimensions, we find “data piling” or concentration. That is, 
the cloud of points becomes concentrated in a point. Now that could be of benefit to 
us, when we are seeking a mean (hence, aggregate) point in a very high dimensional 
space. A further aspect is when it is shown that the cloud piling or concentration is 
very much related to the marginal distributions. 

Here we show how we can test the statistical significance of effectiveness. 

The campaign 7 case, with the distance between the tweet initiating campaign 
7, and the mean campaign 7 outcome, in the full, 338-dimensional factor space is 
equal to 3.670904. 

Compare that to all pairwise distances of non-initiating tweets. We verified that 
these distances are normal distributed, with a small number of large distances. By 
the central limit theorem, for very large numbers of such distances, they will be 
normal distributed. Denote the mean by u, and the standard deviation by o. Mean 
and standard deviation are defined from distances between all non-initiating tweets, 
in the full dimensionality semantic (or factor) space. We find u = 12.64907, u — © = 
8.508712, and u — 20 = 4.368352. 

We find the distance between initiating tweet and mean outcome, for campaign 7, 
in terms of the mean and standard deviation of tweet distances to be: u —2.168451o0. 
Therefore for z = —2.16, the campaign 7 effectiveness is significant at the 1.5% level 
(i.e. z = —2.16, in the two-sided case, has 98.5% of the normal distribution greater 
than it in value). 

In the case of campaigns 1, 4, 5, 6, their distances between initiating tweet and 
mean outcome are less than 90% of all tweet distances. Therefore the effectiveness 
of these campaigns is in the top 10% which is not greatly effective (compared to 
campaign 7). 

In the case of campaigns 3 and 8, we find their distances to be less than 80% of 
all tweet distances. So their effectiveness is in the top 20%. 

Finally, campaign 2 is the least good fit, relative to initiating tweet and outcome. 


5 Tracking Emotion 


This relates to determining and tracking emotion in an unsupervised way. This is 
as opposed to machine learning, like in sentiment analysis, which is supervised. 
Emotion is understood as a manifestation of the unconscious. Social activity causes 
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emotion to be expressed or manifested. This can lead to later discussion of psycho- 
analyst, Matte Blanco. See [14]. 

The foundation of this tracking of emotion, and determining the depth of emo- 
tion, is using the methodology of metric space mapping and hierarchical topology. 
The former here maps the textual data into a Euclidean metric endowed factor space, 
and the latter may be chronologically constrained hierarchical clustering. 

The examples to follow are based on: in the Casablanca movie, dialogue (and 
dialogue only) between main characters Ilsa and Rick, having selected this dialogue 
from the scenes with both of these protagonists (scenes 22, 26, 28, 30, 31, 43, 58, 
59, 70, 75 and last scene, 77); and chapters 9, 10, 11, 12 of Gustave Flaubert’s 19th 
century novel, Madame Bovary. This concerns the three-way relationship between 
Emma Bovary, her husband Charles, and her lover Rodolphe Boulanger. 

Following [16], in Figure 1 in the full dimensionality factor space, based on all 
interrelationships of scenes and words, the distance between the word “darling” in 
this space, was determined with each of the 11 scenes in this space. The same was 
done for the word “love”. The semantic locations of these two words, relative to the 
semantic locations of scenes 30 and 70 are highlighted with boxes. 

Then in Figure 2, hierarchical clustering, that is sequence constrained, is carried 
out on the scenes used, i.e. scenes 22, 26, 28, 30, 31, 43, 58, 59 70, 75, 77 (using 
the dialogue, between Ilsa and Rick). See how the big changes in scenes 30 and 70 
are indicated in the previous figure. 

Now there is consideration of the novel Madame Bovary, with the 3-way interre- 
lationships of Emma Bovary, her husband Charles, and her lover, Rodolphe. 

Figure 3 presents an interesting perspective that can be considered relative to the 
original text. Rodolphe is emotionally scoring over Charles in text segment 1, then 
again in 3, 4, 5, 6. In text segment 7, Emma is accosted by Captain Binet, giving 
her qualms of conscience. Charles regains emotional ground with Emma through 
Emma’s father’s letter in text segment 10, and Emma’s attachment to her daughter, 
Berthe. Initially the surgery on Hippolyte in text segment 11 draws Emma close to 
Charles. By text segment 14 Emma is walking out on Charles following the botched 
surgery. Emma has total disdain for Charles in text segment 15. In text segment 16 
Emma is buying gifts for Rodolphe in spite of potentially making Charles indebted. 
In text segments 17 and 18, Charles’ mother is there, with a difficult mother-in-law 
relationship for Emma. Plans for running away ensue, with pangs of conscience for 
Emma, and in the final text segment there is Rodolphe refusing to himself to leave 
with Emma. 

In Figure 4, there is display of the evolution of sentiment, expressed by (or prox- 
ied by) the terms “kiss”, “tenderness”, and “happiness”. We see that some text seg- 
ments are more expressive of emotion than are other text segments. 
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Semantic distances between two terms and eleven scenes 


Semantic space distance between scene and term 
2 


22 26 28 30 31 43 58 59 70 75 77 


Scenes 


Fig. 1 In the full dimensionality factor space, based on all interrelationships of scenes and words, 
we determined the distance between the word “darling” in this space, with each of the 11 scenes 
in this space. We did the same for the word “love”. The semantic locations of these two words, 
relative to the semantic locations of scenes 30 and 70 are highlighted with boxes. 


6 Analyses of Mapping of Behavioural or Activity Patterns or 
Trends 


This concerns semantic mapping of Twitter data relating to music, film, theatre, etc. 
festivals. 75 languages were found to be in use, including Japanese, Arabic and so 
on, with the majority in Roman script. As indicative association to language, be- 
cause the labelled language may be partially used or not in fact used, we take the 
following: English, Spanish, French, Japanese, Portuguese. We consider the days 
2015-05-11 to 2016-08-02, with two days removed, due to lack of tweets. The num- 
bers of tweets for these languages were as follows (carried out on 11 August 2016): 
en, 37681771; es, 9984507; fr, 4503113; ja, 2977159; pt, 3270839 

The tweeters and the festivals are as follows. Tweets characterized as French, 
4913781 tweets. (For user, date and tweet content, the file size was: 667 MB.) The 
following were sought in the tweets: Cannes, cannes, CANNES, Avignon, avignon, 
AVIGNON. Upper and lower case were retained in order to verify semantic prox- 


726 Fionn Murtagh 


Sequence-constrained hierarchical clustering 
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Fig. 2 Hierarchical clustering, that is sequence constrained, of the 11 scenes used, i.e. scenes 22, 
26, 28, 30, 31, 43, 58, 59 70, 75, 77 (all with dialogue, and only dialogue, between Ilsa and Rick). 
Rather than projections on factors, here the correlations (or cosines of angles with factors) are used 
to directly capture orientation. 


Emma's evolving state vis-a-vis Rodolphe (lines/circles) and Charles (full lines) 
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Text segments 1 to 22 


Fig. 3 The relationship of Emma to Rodolphe (lines/circles, black) and to Charles (full line, red) 
are mapped out. The text segments encapsulate narrative chronology, that maps approximately into 
a time axis. Low or small values can be viewed as emotional attachment. 
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Tracking of sentiments 
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Fig. 4 A low value of the emotion, expressed by the words “kiss”, “happiness” and “tenderness”, 
implies small distance to the text segment. These curves, “kiss”, “happiness” and “tenderness” start 
on the upper left on the top, the middle, and the bottom, respectively. The chronology of sentiment 
tracks the closeness of these different sentimental terms relative to the narrative, represented by 
the text segment. Terms and text segments are vectors in the semantic, factorial space, and the full 
dimensionality of this space is used. 


imity of these variants. These related to the Cannes Film Festival, and the Avignon 
Theatre Festival. The following total numbers of occurrences of these words were 
found, and the maximum number of occurrences by a user, i.e. by a tweeter: Cannes, 
1230559 and 3388; cannes, 145939 and 4024; CANNES, 57763 and 829; Avignon, 
272812 and 4238; avignon, 39323 and 2909; AVIGNON, 14647 and 900. 

The total number of tweeters, also called users here: 880664; total number of 
days retained, from 11 May 2015 to 11 Sept. 2016, 481. Cross-tabulated are: 880664 
users by 481 days. There are 1230559 retained and recorded tweets. The non- 
sparsity of this matrix is just: 0.79% 

In Figure 6, mapped are: C, c, CA (Cannes, cannes, CANNES) and A, a, AV 
(Avignon, avignon, AVIGNON). They are supplementary variables in the Corre- 
spondence Analysis principal factor plane. Semantically they are clustered. They 
are against the background of the Big Data, here the 880664 tweeters, represented 
by dots. 

Current considerations, relating to approximately 55 million tweets per year 
(from May 2015), are as follows. Determine some other, related or otherwise, be- 
havioural patterns that are accessible in the latent semantic, factor space. Retain se- 
lected terms from the tweets, and, as supplementary elements, see how they provide 
more information on patterns and trends. Carry out year by year trend analysis. 
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Fig. 5 880664 Twitter tweets projected on the principal factor, i.e. principal axis plane. Attributes 
are projected. 


For further analyses and description of the data, see [4] and [17]. 


6.1 Baselining or Contextualizing Analysis 


The following is in regard to such baselining, i.e. contextualizing, against healthy 
reference subjects, from a case study in [9]. This repeats some of the description 
in [13], in regard to testing through statistically baselining or contextualizing in a 
multivariate manner. 

In [20], there is an important methodological development, concerning statistical 
inference in Geometric Data Analysis, i.e. based on MCA, Multiple Correspondence 
Analysis. At issue is statistical “typicality of a subcloud with respect to the overall 
cloud of individuals”. Following an excellent review of permutation tests, the data 
is introduced: 6 numerical variables relating to gait, body movement, related to the 
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following; a reference group of 45 healthy subjects; and a group of 15 Parkinsons 
illness patients, each before and after drug treatment. [8] (section 11.1) relates to 
this analysis, of the, in total, 45 + 15 + 15 observation vectors, of subjects between 
the ages of 60 and 92, of average age 74. 

First there is correlation analysis carried out, so that when PCA of standardized 
variables is carried out, it is the case that the first two axes explain 97% of the 
variance. Axis | is characterized as “performance”, and axis 2 is characterized as 
“style”. Then the two sets of, before treatment, and after treatment, 15 Parkinsons 
patients are input into the analysis as supplementary individuals. [20] is directly ad- 
dressing statistically the question of effect of treatment. Just as in [8], the healthy 
subjects are the main individuals, and the treated patients, before and after treat- 
ment, are the supplementary individuals. This allows to discuss the subclouds of the 
before, and of the after treatment individuals, relative to the first, performance, axis, 
and the second, style, axis. The test statistic, that assesses statistically the effect of 
medical treatment here, is a permutation-based distributional evaluation of the fol- 
lowing statistic. The subcloud’s deviations relative to samples of the reference cloud 
are at issue. The Mahalanobis distance based on covariance structure of the refer- 
ence cloud is used. The test statistic is the Mahalanobis norm of deviations between 
subcloud points and the mean point of the reference cloud. 

In summary, this exemplifies in a most important way, how supplementary el- 
ements and the principal elements are selected and used in practice. The medical 
treatment context is so very clear in regard to such baselining, i.e. contextualizing, 
against healthy reference subjects. 


7 Conclusion 


Much that is at issue here is close to what is under discussion in [3]. The integral 
association of methodology and application domain will, of course, have shared 
and common methodological perspectives. However the application of statistical 
models, and other analytical stages such as feature selection, data aggregation with 
the various implications of this, and what is often termed data cleaning or data 
cleansing, all of these issues require analytical focus, and account to be taken of 
the analytical context. The latter may well include baselining, or benchmarking in 
an operational manner. In a sense, we might state that combinatorial inference is so 
paramount because of its applicability. 

A good deal of the case studies reported on here made use of preliminary func- 
tionality, to be part of the R package, Xplortext. This package makes use of 
these R packages, and add greatly to their functionality: tm, FactoMineR. 

The software system, SPAD, is also extending greatly into support for text pro- 
cessing. 
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Workers’ skills and wage inequality: 
A time-space comparison across European 


Mediterranean countries 
Competenze e disuguaglianze salariali: un’analisi spazio- 
temporale per i paesi dell’Europa mediterranea 


Gaetano Musella and Gennaro Punzo 


Abstract The work aims at exploring how the changes in the demand for skills in 
the labour market affect wage inequality comparatively for four countries of 
Southern Europe (Greece, Italy, Portugal, and Spain). Through the Recentered 
Influence Function (RIF) regression of Gini on EU-SILC data, Italy is compared to 
each other country concerned in order to assess the evolution of spatial inequality 
divides during the Great Recession (2005-2013). Gini gaps are then decomposed 
into the composition effect (employees’ endowments) and wage structure (how 
employees’ skills are rewarded). Based on our results, Italy appears to be a less 
unequal country as part of the Mediterranean Europe. A clearer employment 
structure may slow country’s inequality growth and reduce spatial gaps. 


Abstract // lavoro intende analizzare come i recenti cambiamenti nella domanda di 
competenze lavorative influenzino la disuguaglianza salariale nei principali paesi 
dell’Europa Mediterranea (Grecia, Italia, Portogallo e Spagna). Usando dati EU- 
SILC, il lavoro propone la stima di modelli di regressione RIF su Gini al fine di 
valutare le dinamiche spazio-temporali della disuguaglianza dei salari negli anni 
della Grande Recessione. I divari spaziali di disuguaglianza sono decomposti con il 
duplice obiettivo di separare la quota di gap totale attribuibile alla distribuzione 
territoriale delle dotazioni dei lavoratori dalla quota legata alla differente capacità 
dei mercati del lavoro nazionali di trasformare tali dotazioni in opportunità 
lavorative e guadagni. I risultati restituiscono un'immagine dell’Italia “meno 
diseguale” rispetto agli altri paesi del Mediterraneo ed evidenziano come strutture 
di mercato ben definite possano rallentare la disuguaglianza e ridurre i gap spaziali 
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1. Background and aim of the work 


The Great Recession is the biggest macroeconomic downturn since the 1930s. Its 
origins can be traced back to the 2007 global financial crisis that, in turn, was 
generated by the housing bubble and banking crisis in the United States. The 
economic recession reached its peak in Europe before the end of 2009 with the 
inevitable rising of unemployment and income inequality. Although the crisis hit all 
EU Member States hard, each country was affected with different proportion, timing 
and strength. However, nations with weaker economies suffered the most, and in 
particular, the four European Mediterranean countries — Portugal, Italy, Greece, and 
Spain — were damaged more extensively [7]. For instance, their total and youth 
unemployment rates were significantly greater than the Eurozone average (Eurostat 
on-line database). 

The potential causes of rising unemployment and changing in the employment 
composition are widely discussed in literature, and for some years, many Authors 
have been focusing their attention on the shrinking of middle-skill jobs [1,2,3,5]. 
The progressive decrease in the demand for the middle-skilled workers has 
generated different structures of the labour market — such as job polarisation, 
upgrading and downgrading of occupations — of most developed economies. 
Specifically, job polarisation represents the case in which the demand for jobs 
requiring high and low skills grows simultaneously. Upgrading of occupations 
occurs if the growth involves the high-skill jobs exclusively, while the downgrading 
is the case in which low-skill occupations grow faster than other ones. 

In this field, the goal of the present work is to explore the ways in which the 
decline of middle-skill jobs affects wage inequality comparatively for four countries 
of Southern Europe. At this purpose, the Recentered Influence Function (RIF) 
regression [4] allows achieving two objectives. First, evaluating the direction and 
intensity of the inequality spatial gap of Italy compared to each other country 
concerned (Greece, Portugal, and Spain). Second, decomposing the spatial 
inequality gaps into the two components of composition effect, which quantifies how 
much of the gap is due to the employees’ endowments, and wage structure, which 
measures how much of the same gap is attributable to the capability of the country’s 
labour market to reward the employees’ characteristics. 


2. Sketching the labour market structures of the Southern Europe 


This Section focuses on the changes in employment composition by skill levels that 
have occurred between 2005 and 2013 in Greece, Italy, Portugal, and Spain. To do 
this, three groups of high-, middle- and low-skilled workers have been created using 
the average level of formal education as proxy of workers’ skills [3]. Data are from 
the EU-SILC (European Union-Survey on Income and Living Condition) survey that 
categorizes jobs according to the International Standard Classification of Occupation 
(ISCO-08). The analysis focuses on employees, aged 16-64, defined as anyone who 
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works for a public or private employer and receives compensation in the form of a 
wage, salary, fee, payment by result or in kind. 

Concisely, our results show that the shrinking of middle-skill jobs is a common 
element within the Mediterranean countries. More precisely, it is worth noting how 
the drop of middle-skill jobs may generate different patterns of the employment 
structure in each country’s labour market (Figure 1). 


Figure 1: Employment shares by skill levels. Percentage changes 2005-2013. 


The relative growth of both low- (+40%) and high-skill jobs (+57%) allows 
classifying the Greek labour market as polarised in 2005-2013. Instead, in Portugal, 
the sharp rise in the demand of high-skilled employees (+125%), combined with the 
fall in the demand of low-skilled employees (-28%), provides evidence of upgrading 
of occupations. Italy and Spain share more hybrid patterns because neither of the 
two phenomena of job polarisation or upgrading clearly prevails. The drop of job 
opportunities for each differently skilled group of employees makes it difficult to 
draw clear conclusions about the structure of the Italian and Spanish labour markets. 


3. Methodology: RIF of Gini on log-wage 


The Recentered Influence Function regression [4] of Gini on log-wage will be 
performed to evaluate Gini differentials between Italy and each other country 
concerned. Once estimated RIF regression by country, the twofold decomposition 
(composition effect and wage structure) is carried out. 

The observed wage (Y;) can be written without imposing a specific functional form 
considering the wage determination function of observed components X; and some 
unobserved components £i: 


Yoi = fy(Xi€i): for g=A,B (1) 


where g = A for employees belonging to the country of reference (in this work, 
Italy) and g = B for employees from each other country (alternatively, Greece, 
Portugal, and Spain). 
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The RIF-regression replaces the log-wage as dependent variable with the 
recentered influence function of Gini coefficient v(F). Let v be the generic 
distributional statistic to study and IF the influence function [6], the RIF-regression 
is: 


RIF(Y;v) = IF(Y;v)+v 
(2) 


The above expression can be written as: 


E[RIF (Y; v)|X] = XB” 
(3) 


where £” represents the marginal effect of X on v. The v-overall wage inequality gap 
between the countries A and B can be measured as follows: 


AG= Vg (Fe) — va (Fa) = vg — Va 


(4) 
that can be decomposed into: 
AG= (Vg — Ve) + (Ve — Va) = AS + Ax (5) 


The overall gap between Italy and each other country is decomposed into wage 
structure (A?) and composition effect (AY). The first term corresponds to the effect 
on v of a change from f,(-,") to f,(."), keeping the distribution of (X,¢)|G = B 
constant. The second term keeps constant the wage structure f,(-,-) and measures the 
effect of changes from (X,£)|G =B to (X,€)|G = A. However, the key term for 
decomposing the total gap is the counterfactual distributional statistics v, that 
represents the distributional statistic that would have prevailed if employees from 
the country A have the wage structure of those from the country B. 

As regards the Gini coefficient, the distributional statistic v is defined as follows: 


vC (Fy) =1- 2u-*R(Fy) 


(6) 

where R(Fy) = JE GL(p;Fy)dp with p(y) = Fy(y) and the generalised Lorenz 
-1 

ordinate of Fy is given by GL(p; Fy) = ee p) zdF,(z). As demonstrated by Firpo 


et al. [4], the recentered influence function of Gini can be rewritten as: 
RIF (y; v°°) = 1 + 2u-*R(Fy) — 2u7*[y[1 — pO)] + GLO); Fy)] (7) 


The gaps in the Gini index between the two countries may be decomposed as in 
equation (5). 
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4. Main results 


This Section shows the main results of RIF-decomposition that allow evaluating the 
wage inequality gaps of Italy with each other country in both 2005 and 2013 and 
their evolution over time (Tables 1-2). 


Table 1: RIF decomposition of Gini on log-wage. Gap between Italy and each other country — 2005 


2005 
Spain- % Portugal- % Greece- % 
Italy share Italy share Italy share 
Total Gap 0.0080" - 0.0102% - 0.0044*** - 

(0.0004) (0.0006) (0.0006) 

Composition 0.0030** 38.24 -0.0040*** -39.79 0.0004 10.79 
Effect (0.0003) (0.0005) (0.0006) 

Wage 0.0049"™" 61.76 0.0142” 139.79 0.0039" 89.21 
Structure (0.0004) (0.0006) (0.0008) 


"Significant at 10%; “Significant at 5%; "Significant at 1%. Standard errors in brackets. 


Table 2: RIF decomposition of Gini on log-wage. Gap between Italy and each other country — 2013 


2013 
Spain- % Portugal- % Greece- % 
Italy share Italy share Italy share 
Total Gap 0.0116" - 0.0037" - 0.0016™* - 

(0.0006) (0.0006) (0.0008) 

Composition 0.0047" 40.12 -0.0034"" -94.04 0.0021" 131.61 
Effect (0.0005) (0.0003) (0.0007) 

Wage 0.0069°°* 59.87 0.0071" 194.04 -0.0005 -31.61 
Structure (0.0007) (0.0007) (0.0009) 


“Significant at 10%; “Significant at 5%; “**Significant at 1%. Standard errors in brackets. 


Between 2005 and 2013, the overall Gini has increased in Spain, Italy and 
Greece, while it has slightly decreased for Portugal, changing the intensity of 
inequality gaps over time. In fact, while Italy shows the lowest level of wage 
inequality in both 2005 and 2013, the spatial gap has progressively reduced over 
time for Portugal (from 0.0102 to 0.0037) and Greece (from 0.0044 to 0.0016) and it 
has increased for Spain (from the 0.0088 in 2005 to 0.0116 in 2013). In detail, the 
decrease of wage inequality in Portugal makes smaller the spatial differential with 
Italy in 2013, and similarly, the spatial gap between Italy and Greece has narrowed 
because inequality has increased in Greece to a lesser extent than Italy. Instead, the 
harshest increase in Spanish wage inequality makes the spatial differential with Italy 
even wider in 2013. 
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For Spain and Portugal, a great deal of spatial inequality gap is attributable to the 
wage structure (2005: 81.01% and 139.22%, respectively; 2013: 62.61% and 
202.6%, respectively), highlighting how their larger inequality depends on the lower 
capacity of their labour markets to reward the employees’ characteristics. Some 
distinctions are appropriate regarding the composition effect of the two countries. In 
Spain, the composition effect plays a significant role (around 40%) in widening the 
inequality differential, showing that, beyond the changes in demanding skills, the 
divide is also potentially due to the different employees’ endowments of the two 
countries. By contrast, in Portugal, the composition effect is even negative. It means 
that the disadvantage of the wage structure in widening the inequality gap between 
Italy and Portugal is partly compensated by the larger availability, in the Portuguese 
workforce, of high-skilled employees (as evidenced by the upgrading structure of its 
labour market) who usually earn higher salaries. The same composition effect plays 
a leading role to explain the gap between Italy and Greece in 2013 (188.2%). As 
regards the wage structure in the comparison with Greece, it is worthy of attention 
its evolution over time: while in 2005 it was the primary component of the total 
spatial gap, in 2013 it became no significant and the composition effect remains the 
only responsible of the higher wage inequality in Greece. 

Briefly, the spatial comparison highlights how inequality gaps may also be 
explained by the differences in labour market structures and their ability to reward 
the investment in skills. The evolution of inequality divides over time shows how 
the well-defined structures of upgrading (Portugal) and job polarisation (Greece) 
seem to have an equalising effect on the countries’ wage distribution, lessening both 
the overall inequality within countries and penalties respect to Italy. Instead, a more 
hybrid structure (Spain) exacerbates wage inequality where both the wage structure 
and composition effect play a key role to explain the spatial gap with Italy. 
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Exploratory factor analysis of ordinal variables: 
a copula approach 


Analisi fattoriale esplorativa di variabili ordinali: un 
approccio via copula 


Marta Nai Ruscone 


Abstract Exploratory factor analysis attempts to identify the underlying factors that 
explain the pattern of correlations within a set of observed variables. The analy- 
sis is almost always performed with Pearson’s correlations even when the data are 
ordinal, but this is not appropriate since they are not quantitative data. The use of 
Likert scales is increasingly common in the field of social research, so it is neces- 
sary to determine which methodology is the most suitable for analysing the data 
obtained as non quantitative measures. In this context, also by means of simulation 
studies, we aim to illustrate the advantages of using Spearman’s grade correlation 
coefficient on a transformation operated by the copula function in order to perform 
exploratory factor analysis of ordinal variables. Moreover, by using the copula, we 
consider the general dependence structure, providing a more robust reproduction of 
the measurement model. 

Abstract L’analisi fattoriale esplorativa vuole identificare i fattori latenti che sp- 
iegano un insieme di variabili osservate. L’analisi quasi sempre utilizza la corre- 
lazione di Pearson, anche quando i dati sono di natura ordinale, ma questo non 
é appropriato in quanto questi dati non sono quantitativi. L’uso di scale Likert é 
sempre piu comune nel campo della ricerca sociale, risulta quindi necessario de- 
terminare quale metodo risulta essere più idoneo per l’analisi di tali dati tenendo 
presente che spesso vengono analizzati utilizzando tecniche idonee solo per mis- 
ure quantitative. In questo contesto, e mediante studi di simulazione, si illustrano i 
vantaggi nell’utilizzo dello Spearman grade correltion ottenuto mediante l’utilizzo 
dalla funzione copula anziché della correlazione di Pearson. Con l’utilizzo della 
copula, si considera cosí la struttra di dipendenza generale, fornendo cosí una mis- 
urazione più accurata 


Key words: Factor analysis, copula, ordinal variables, Likert scales, correlation 


Marta Nai Ruscone 
School of Economics and Management - LIUC - University Cattaneo, C.so Matteotti 22 - 21053 
Castellanza (VA), Italy, e-mail: mnairuscone@liuc.it 


737 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


738 Marta Nai Ruscone 


1 Introduction 


Exploratory factor analysis is a widely used statistical technique in the social sci- 
ences where the main interest lies in measuring the unobserved construct, such as 
emotions, attitudes, beliefs and behaviors. The main idea behind the analysis is that 
the latent variables (also named factors) account for the dependencies among the 
observed variables (also named items or indicators) in the sense that if the factors 
are held fixed, the observed variables would be independent. In exploratory factor 
analysis the goal is the following: for a given set of observed variables x), ...,xp) one 
wants to find a set of latent factors é,..., é,, fewer in number than the observed 
variables (k < p), that contain essentially the same information. In its classical for- 
mulation [1], it concerns a set of continuous variables measured on a set of indepen- 
dent units. The data usually encountered in social sciences are of categorical nature 
(ordinal or nominal). The Likert Rating Scale [10], [11] is a simple procedure for 
generating measurement instruments which is widely used by social scientists to 
measure a variety of latent constructs, and meticulous statistical procedures have 
therefore been developed to design and validate these scales [3], [15]. However, 
most of these ignore the ordinal nature of observed responses and assume the pres- 
ence of continuous observed variables measured at interval level. Evidence shows 
that, under relatively common circumstances, classical factor analysis (FA) yields 
inaccurate results characterizing the internal structure of the scale or selecting the 
most informative items within each factor [4], [7]. 

In the present work Spearman’s grade correlation coefficient on a transforma- 
tion operated by the copula function is employed, in order to take into account the 
ordinal nature of the data. The copula is a helpful tool for handling multivariate 
continuous distributions with given univariate marginals [14]. It describes the de- 
pendence structure existing across pairwise marginal random variables. In this way 
we can consider bivariate distributions with dependence structures, different from 
the linear one, that characterises the multivariate normal distribution. 

So taking into account that the use of measurement instruments which require 
categorical responses from subjects is increasingly common in social research, and 
this implies the use of ordinal scales, the present work aims to point out a cor- 
rect definition of dependence measure for ordinal variables rather than the Pearson 
correlation coefficient correctly applied to quantitative variables. Moreover, the use 
of several copulae with specific tail dependence allow us to obtain an index that 
weights the ordinal variables categories in several ways. In so doing we can address 
and recognize the ordinal nature of observed variables and estimate that weight di- 
rectly from the data. 


2 The copula function 


The copula function is the key ingredient for handling multivariate continuous dis- 
tributions with given univariate marginals. We will discuss this issue briefly below, 
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for further details and proofs, see for instance [14], [8] and [2]. It describes the de- 
pendence structure existing across pairwise marginal random variables. In this way 
we can consider bivariate distributions with dependence structures different from 
the linear one that characterises the multivariate normal distribution. 

A bivariate copula C : I — I, with 7? = [0,1] x [0,1] and Z = [0,1], is the cu- 
mulative bivariate distribution function of a random variable (U1,U2) with uniform 
marginal random variables in [0,1] 


C(u1,u2;0) =P(U 1 <u1,Ux < u2;0), O<u <1 O<m<l (1) 


where @ is a parameter measuring the dependence between U; and U2. 

The following theorem by Sklar [14] explains the use of the copula in the char- 
acterization of a joint distribution. Let (X),X2) be a bivariate random variable with 
marginal cdfs Fy, (x1) and Fy, (x2) and joint cdf Fx, x, (x1,x2; 0), then there is always 
a copula function C(-,-;@) with C : I? — I such that 


Fy, x (*1,423 0) = C (Fx (x1), Fy (22): 0), x1,x3 € R. (2) 


Conversely, if C(-,-; 0) is a copula function and Fx, (xı) and Fx, (x2) are marginal 
cdfs, then Fx, x, (x1,x2;0) is a joint cdf. 
If Fx, (xı) and Fx, (x2) are continuous functions then the copula C(-,-; 0) is unique. 
Moreover, if Fx, (xı) and Fx,(x2) are continuous the copula can be found by the 
inverse of (2): 
C(u1,u2) = Fx x2 (Fg (u1), Fg (w2)), (3) 


with u = Fy, (x1) and uz = Fy, (x2). This theorem states that each joint distribution 
can be expressed in term of two separate but related issues, the marginal distri- 
butions and the dependence structures between them. The dependence structure 
is explained by the copula function C(-,-;@). Moreover the (2) provides a general 
mechanism to construct new multivariate models in a straightforward manner. By 
changing the copula function we can construct new bivariate distributions with dif- 
ferent dependence structures, with the association parameter indicating the strength 
of the dependence, also different from the linear one that characterizes the multi- 
variate normal distribution. 

Each copula is related to the most important measures of dependence: the Pear- 
son correlation coefficient, the Spearman grade correlation coefficient and tail de- 
pendence parameters. The Spearman grade correlation coefficient (see [14] pp. 169- 
170 for the definition of the grade correlation coefficient for continuous random 
variables) measure the association between two variables and can be expressed as a 
function of the copula. More precisely, if two random variables are continuous and 
have copula C with parameter 0, then the Spearman grade correlation is 


Cov(U,, U2) 


Var(U,).\/Var(U2) 


Ps(C) = 12 , Co (u1, u2)du;duz 3= (4) 
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For continuous random variables this is invariant with respect to the two marginal 
distributions, i.e. it can be expressed as a function of its copula. This property is 
also known as ’scale invariance’. Note that not all measures of association satisfy 
this property, e.g. Pearson’s linear correlation coefficient [6]. Among all copulas 
C: I? — I such that for every u,v € I, three especially noteworthy ones are W (u,v) = 
max(u+v— 1,0), (u,v) = uv, and M(u,v) = min(u,v). These copulae correspond 
to perfect negative association (Ps(C) = — 1), independence (ps(C) = 0), and perfect 
positive association (ps(C) = +1) between the two random variables, respectively. 
For all (u,v) € 7? it holds that W (u,v) < IT(u,v) < M(u,v). 

The tail dependence relationship can be measured by means of the upper and lower 
tail dependence parameters 


DA lim PIX > F; '(u)|X > FT '(u)] = lim Ciuu) (5) 
uS17 uS17 u 
l 7 7 . 1-2u+C(u,u) 
x < 1 < 1 = es 
Ay = lim, PO < Fy (u)|Xi < Fy (u)] = lim TTT 9 


If A, € (0,1] or A; € (0, 1], the random variables X; and X3 present upper or lower 
tail dependence. If A, = 0 or A; = 0, there is no upper or lower tail dependence. 
These parameters measures the dependence in the tails of the joint distribution, i.e. 
high/low values of one variable are associated with high/low values of the other 
one. They represent the probability that one variable is extreme given that the other 
is extreme. The Spearman grade correlation coefficient and both tail dependence 
parameters are directly associated with the parameters of some copula family [14]. 


3 Our proposal 


Theory and methodology for exploratory factor analysis have been well developed 
for continuous variables, but in practice observed or measured variables are often 
ordinal. 

Observations on an ordinal variable are assumed to have logical ordering cate- 
gories. This logical ordering is typical when data are collected from questionnaires. 
A good example is the Likert Scale that is frequently used in survey research: 1 = 
Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, and 5 = Strongly agree. 
Although a question is designed to measure a theoretical concept, the observed re- 
sponses are only a discrete realization of a small number of categories and distances 
between categories are unknown. Following [13], [9] and others, it is assumed that 
there is a continuous variable x;* underlying the ordinal variable x;, i= 1,..., p. This 
continuous variable x;* represents the attitude underlying the order responses to x; 
and it is assumed to have a range from —ce to +00. 

The underlying variable x;* is unobservable. Only the ordinal variable x; is ob- 
served. For an ordinal variable x; with m; categories, the connection between the 
ordinal variable x; and the underlying variable x;* is: 
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xi Thy <x < Ti, i= 1,2,...,m; (7) 


where 
0 = T9 < T] << < Tm] < Ty, = H (8) 


are threshold parameters. For variable x; with m; categories, there are mj — 1 strictly 
increasing threshold parameters ti < Ti AS Toi 

Let x; and x; be the two ordinal variables with m; and m; categories respectively. 
We define now Spearman’s grade correlation via copula. We consider a copula Cg 
associated with each pair (X;*, X;*) underlying the pair (X;,X;) in the set of ordinal 
items X,X2, ...,X;, we thus assume that each pair (X;,X;) corresponds to a bivariate 
discrete random variable obtained by a discretisation of a bivariate continuous latent 
variable U; = F (X;*),Uj; = F (X;*) with support on the unit interval. 

Let Ajj = [wi-1, ui] x [vj-1,uj], i= 1,2,...,mi j = 1,2,...,mj, be the rectangles 
defining the discretisation. Let p11,..., Pmimj be the joint probabilities of the ordinal 
variables corresponding to the rectangles A11, Amm; Let Vo, (Att; sAmm;) be 
the volumes of the rectangles under the copula Cg, then 


Veg (A11,Amim;) = P115 -++ Pim; (9) 


There exists a unique element in the family of copula for which (9) holds true. We 
apply this to each pair (X;,X;) i # j in the set of the items. @ can be estimated 
via maximum likelihood [5] [12]. The multivariate normality assumption pertaining 
to the underlying variables, assumed by polychoric correlation and Pearson cor- 
relation, is relaxed. To apply the index one needs only to specify the dependence 
structure of the variables by means of a copula family. 

In this way the construct validity is analysed according to ordinal data obtained 
from Likert scales using the most suitable method. The factor results show a better fit 
to the theoretical model when the factorization is carried out using the Spearman’s 
grade correlation via copula rather than Pearson correlation. Our focus here has 
been to identify the type of correlation that yields a factor solution more in keeping 
with the original measurement model, as we believe this to have great importances 
in terms of drawing correct substantive conclusions. When we conduct a FA our 
results can be summarized as follow: 


e regardless of the number of dimensions and items with skewness, Pearson cor- 
relations are lower than Spearman’s grade correlations. The results are more sig- 
nificant when all items are asymmetric. 

e The model obtained is more consistent with the original measurement model 
when we factorize using the Spearman’s grade correlation. This result does not 
depend on the number of dimensions and asymmetric items. 


To summarize the factor results obtained when we use Spearman’s grade corre- 
lation better reproduce the measurement model present in the data, regardless of the 
number of factors. 
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IPUMS Data for describing family and household 
structures in the world. 
I dati IPUMS per la descrizione della struttura delle famiglie nel 


mondo. 


Fausta Ongaro, Silvana Salvini 


Abstract Our research focuses on the change of the characteristics of families and 
households in developing (especially sub-Saharan countries) and developed 
countries (especially European countries) in the last twenty years. The choice of 
countries depends on the available variables in the different data sets. 

Data used refer to IPUMS data base. IPUMS-International is dedicated to collecting 
and distributing census data from around the world. The database currently describes 
approximately 614 million persons recorded in 277 censuses taken from 1960 to the 
present. The database includes censuses from 82 countries. The data series includes 
information on a broad range of population characteristics, both at household and 
individual level, including fertility, mortality, occupational structure, education, 
ethnicity, and household composition. These last data are at the core of our 
contribution. The information available in each sample varies according to the 
questions asked in every census (https://international.ipums.org/international/). 


Abstract La nostra ricerca si focalizza sul cambiamento delle caratteristiche delle 
famiglie nei paesi in via di sviluppo (in particolare i paesi dell’Africa sub- 
Sahariana) e nei paesi sviluppati (in particolare quelli europei) negli ultimi 20 anni. 
La scelta dei paesi dipende sostanzialmente dalle variabili disponibili nei diversi 
data set. I dati utilizzati fanno parte del data base IPUMS dedicato alla raccolta e 
alla diffusione dei dati censuari dei diversi paesi del mondo. Il data base contiene 
approssimativamente 614 milioni di record riferiti a 277 censimenti dal 1960 ad 
oggi di 82 paesi. Le informazioni riguardano un grande numero di caratteristiche 
della popolazione, sia a livello familiare sia a livello individuale, sulla fecondità, la 
mortalità, la struttura occupazionale, l'istruzione e l'etnia, oltre alle informazioni 
sulla composizione familiare. E’ su quest’ultima serie di dati che concentreremo la 
nostra attenzione per l’analisi. 


Key words: Census data, Family structure, Household characteristics, Developed 
and developing countries. 
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1 Introduction 


Our research focuses on the change of the characteristics of families and households 
in developing (especially sub-Saharan countries; see Bongaarts, 2001 and Randall et 
al., 2015) and developed countries (especially European countries; see Keilman, 
2006) in the last twenty years. The choice of countries depends on the available 
variables in the different data sets. 

Data used refer to IPUMS data base. IPUMS-International is dedicated to collecting 
and distributing census data from around the world. The database includes censuses 
from 82 countries. The data series includes information on a broad range of 
population characteristics, both at household and individual level, including fertility, 
mortality, occupational structure, education, ethnicity, and household composition. 
These last data are at the core of our contribution. The information available in each 
sample varies according to the questions asked in every census 
(https://international.ipums.org/international/). 


2 Data used 


Micro-data will be used to analyze countries' regions using NUTS for Europe and 
other territorial subdivisions for developing countries. Methods used to detect 
similarities and dissimilarities of regions are represented by cluster analysis and 
other multivariate techniques apt to analyze large data sets. 

The database currently describes approximately 614 million persons recorded in 277 
censuses taken from 1960 to the present. The database we use includes censuses 
from both European and sub-Saharan countries (Austria, Belarus, Burkina Faso, 
Cameroon, Ethiopia, France, Germany, Ghana, Guinea, Greece, Hungary, Ireland, 
Italy, Kenya, Liberia, Malawi, Mali, Mozambique, Netherlands, Nigeria, Portugal, 
Romania, Rwanda, Senegal, Sierra Leone, Slovenia, South Africa, South Sudan, 
Spain, Sudan, Switzerland, Tanzania, Uganda, United Kingdom, and Zambia). For 
most of these countries, more than one census is present. 

Most population data — especially census data — have traditionally been available 
only in aggregated tabular form. IPUMS-International is composed of microdata, 
which means that it provides information about individual persons and households. 
Since this data base includes most of the information originally recorded by census, 
users can construct a great variety of tabulations interrelating any desired set of 
variables and perform models directly using these variables. 


3 Methods 


After a preliminary analysis of the definition of household (and of the meaning of 
child and parent) which are used in the different censuses, individual (especially, 


How green advertising can impact on gender different approach towards sustainability 745 
sex, age, family relationship) and household (especially, n. components, n. unrelated 
persons, n. families) variables are used to build macro variables able to describe the 
family and household structures in the countries of sub-Saharan region and in 
European Union. 

We will use the macro-data information to conduct cluster analyses on family and 
household macro data at a sub-national geographic level of both Europe and sub- 
Saharan Africa to examine the regions that show similar characteristics and trends at 
family level (WenYang Yu et al., 2015). 


4 Preliminary results 


Descriptive analysis shows the following distributions of individuals and the mean 
number of persons in the households classified by country. 

Large differences emerge between the two groups of countries (UN, 2004): 
European regions generally show a low number of person per household and, on the 
contrary, sub-Saharan countries present a mean number of person per household very 
high. In particular, we note the high value in Senegal, but also Guinea, Sierra Leone 
and Burkina Faso show mean number of persons per household higher than 8. On the 
countrary, in Europe the dimension of families is very lower, and Germany in 
particular shows a value lower than 3. Many countries, such as Italy and France, 
present values on a little higher. 


Table 1 - Number of persons, mean number of persons per household and 
standard deviation by country 


Europe 

Country Mean Nb. St. Dev. 

Austria 3.49 3929934 2.035 
Belarus 3.26 990706 1.398 
France 3.29 55880084 2.486 
Germany 2.86 14623488 2.014 
Greece 3.79 3749350 1.589 
Hungary 3.54 2079868 1.792 
Ireland 4.32 3377884 2.211 
Italy 3.23 2990739 1.336 
Portugal 3.72 2029940 1.797 
Romania 3.89 6313566 1.809 
Slovenia 3.47 179632 1.288 
Spain 2.98 10162418 1.726 
Switzerland 3.30 1337224 1.699 
Ukraine 3.40 4889288 1.753 
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sub-Saharan Africa 


Cameroon 7.92 3406084 5.726 
Ethiopia 5.86 15882990 2.597 
Ghana 7.26 5669774 5.057 
Guinea 9.74 1186908 6.869 
Kenya 5.61 8016659 4.186 
Liberia 5.25 498313 4.271 
Malawi 5.97 3132039 4.627 
Mali 8.58 3228570 5.417 
Mozambique 5.79 3598565 3.013 
Nigeria 6.39 426395 3.356 
Rwanda 5.99 1586310 2.629 
Senegal 13.32 1694761 8.313 
Sierra Leone 8.56 494298 5.452 
South Africa 5.34 12813070 3.132 
South Sudan 7.59 542765 4.279 
Sudan 6.90 5066530 3.336 
Uganda 6.62 4045909 3.496 
Tanzania 6.67 6043159 3.887 
Burkina Faso 8.95 3383667 5.200 
Zambia 7.26 3105551 3.857 
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Topological Summaries for Time-Varying Data 
Sintesi Topologiche per Serie Storiche 


Tullia Padellini and Pierpaolo Brutti 


Abstract Topology has proven to be a useful tool in the current quest for “insights 
on the data”, since it characterises objects through their connectivity structure, in an 
easy and interpretable way. More specifically, the new, but growing, field of TDA 
(Topological Data Analysis) deals with Persistent Homology, a multiscale version 
of Homology Groups summarized by the Persistence Diagram and its functional 
representations (Persistence Landscapes, Silhouettes etc). All of these objects, how- 
ever, are designed and work only for static point clouds. We define a new topological 
summary, the Landscape Surface, that takes into account the changes in the topology 
of a dynamical point cloud such as a (possibly very high dimensional) time series. 
We prove its continuity and its stability and, finally, we sketch a simple example. 
Abstract A causa della crescente complessita dei dati, diventa sempre pitt impor- 
tante riuscire a sintetizzarli attraverso un numero ridotto di caratteristiche inter- 
pretabili. Lo studio delle invarianti topologiche si é dimostrato utile in questo senso, 
in quanto caratterizza un oggetto in termini della sua struttura di connettivita. In 
particolare, lo studio della topologia dei dati viene condotto a partire da una ver- 
sione multiscala dei gruppi omologici detti gruppi di omologia persistente, rappre- 
sentati da oggetti come il diagramma di persistenza, che rappresenta i generatori 
di tali gruppi, e le sue trasformazioni in spazi di funzioni. In questo lavoro introdu- 
ciamo un nuovo strumento, costruito per studiare l’evoluzione delle caratteristiche 
topologiche di serie storiche multidimensionali, la ” Landscape Surface”. Dopo av- 
erne provato continuità e stabilità, accenneremo ad una sua applicazione in un 
semplice esempio. 


Key words: Persistent Homology, Time Series, Topological Inference 
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1 Introduction to TDA 


As we are dealing with increasingly complex data, our need for characterising them 
through a few, interpretable features has grown considerably. In recent years there 
has been quite some interest in the study of the “shape of data” [2]. Among the many 
ways a “shape” could be defined, topology is the most general one, as it describes 
an object in terms of its connectivity structure: connected components (topologi- 
cal features of dimension 0), cycles (features of dimension 1) and so on. There is 
a growing number of techniques (generally denoted as Topological Data Analysis) 
aimed at estimating the shape of a point-cloud through some topological invariant. 
In this work we extend those techniques to the case of multivariate time series, 1.e. 
when, rather than considering only one point-cloud, we are dealing with a collection 
of point-clouds indexed by time, as for example in animal migration, player tracking 
in sports, EEG signals and most spatio-temporal data; our goal is to summarize in 
one object not only the shape of the data at each fixed time, but also how this shape 
changes with time. 


Before introducing new objects, it is worth briefly reviewing what Topological Data 
Analysis (TDA) is, and how can we estimate the topology of data, or, to be more pre- 
cise, the topology of the space .@ data was sampled from. As a matter of fact, data 
itself, when in the form of a point cloud X = {X,...,X,}, has a trivial topological 
structure, consisting of as many connected components as there are observations 
and no higher dimensional features. The basic idea in the TDA is thus to use data 
to build “shape aware” estimates of W and then compute topological invariants. 
One of the most common way of estimating .//, in TDA, is Devroye-Wise support 
estimator Me built by centering a ball of fixed radius € in each of the observations 
Xi, i.e. 
a n 
Me =|] B(Xi, £) 
i=1 
where B(Y, ô) denotes a ball of radius 6 and center Y. For each value € we obtain a 


different estimate s, whose topology can be recovered by computing its Homol- 
ogy Groups. Persistent Homology, a multiscale version of Homology, then allows 
us to analyze how those Homology Groups change with €. 


Persistent Homology Groups can be summarized by the Persistence Diagram, a 
multiset D = {(bj,d;),i = 1,...,m} whose generic element (b;, d;) is the generator of 
the i-th Persistent Homology group. The space of persistence Diagrams % is a met- 
ric space, when endowed with the Bottleneck distance, which, given two multisets 
A and B, is defined as 


ds(A,B)= infsup || x— (3) lle 
Y x€A 


where the infimum is taken over all bijections y: A + B. 
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Fig. 1 Me for different values £. For small values of £ (left), the topology of Me is close to the one 
of the point cloud itself. As € grows more and more points start to be connected, until eventually 
(right) the corresponding Me is homeomorphic to a point. Values £,, €y of € corresponding to 
when two components are connected for the first time (birth-step) and when they are connected to 
some other larger component (death-step) are the generators of a Persistent Homology Group. 


The Bottleneck distance allows us to compare Persistence Diagrams and to define 
their most important property: stability [4]. 


Theorem 1. Let X, Y two point clouds, and Dx, Dy their corresponding Persistence 
Diagrams, then 
dg (Dx, Dy) < 2dy (X,Y) 


where dy(A,B) is the Hausdorff distance between two topological spaces A and B. 


Roughly speaking, this means that if two point clouds are similar, then their Per- 
sistence Diagrams will be as well, and is therefore instrumental for using them in 
statistical tasks such as classification or clustering. 


Since Persistence Diagrams are general metric objects, it is usually advisable to 
transform them in order to work with more statistics-friendly spaces. The most fa- 
mous transformations of the persistence diagram are the persistence landscape [1] 
and the persistence silhouette[3], which are functions built by mapping each point 
z = (bj,d;) of a Persistence Diagram D to a piecewise linear function called the 
“triangle” function 7., defined as 


T.(9) = (Y — bi +-d;)1p;-a;.b](1) + (bi + di — Y) Lb; bia] O) 


where 14(x) = 1 if x € A and 14(x) = 0 otherwise. Informally a triangle function 
links each point of the diagram to the diagonal with segments parallel to the axes, 
which are then rotated of 45 degrees. 


The blocks T, can be combined in many different ways. If we take their kmax, i.e. 
the k-th largest value in the set 7.(y), we obtain the Persistence Landscape 


Ap(k.y)=kmaxT-(y)  kezZ*. 
ZE. 


The persistence landscape is the collection of functions Ap(k,y). If we take the 
weighted average of the functions 7.(y), we have the Power Weighted Silhouette 
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Although we are loosing some information in going from Persistence Diagrams 


Fig. 2 Persistence Diagram (left), Persistence Landscape (center) and Persistence Silhouette for 
different values of p (right) of the data shown in Fig. 1 


to Persistence Landscapes, the main result we had for Diagrams, i.e. stability, still 
holds [1]. 


Theorem 2. Let X,Y two point clouds, Dx, Dy their corresponding Persistence Di- 
agrams, and Ax, Ay their corresponding Persistence Landscapes, then 


da (Ax, Ay) < dg (Dx, Dy) < 2dy (X, Y) 


where da (Ax, Ay) =|| Ax — Ay llo is the L” distance in the space of Persistence 
Landscapes. 


2 The Landscape Surface 


In order to study the evolution of the topological structure of time-varying data, we 
think of a multidimensional Time series X(t) as a dynamic point cloud; for every 
fixed time ¢ we can use the tools we have previously defined and build a Persistence 
Diagram D(t), Landscape Ax) (k,y) and Silhouette. Intuitively we can consider this 
Persistence Lanscape Ax) as a function of time ¢ as well, which means that we can 
work with a surface, rather than just a curve. It is important to notice that although 
in the following we focus on Landscapes, the same results hold for Silhouettes as 
well. 


Definition 1. Given a dynamic point cloud X(t) we define the Lanscape Surface as 
the function 
A(t,k,y) = Ax (ky) Vt,k,y. 


This surface is still a meaningful topological summary, as we can prove its stability. 
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Theorem 3. Let {X(t), Y(t)} with t € (0,1) two continuous dynamic point clouds, 
Ax and Ay their corresponding Landscape Surfaces, then: 


1. Ax and Ay are continuous; 
2. I4 (Ax, Ay) < (X,Y) 


where I, = So da (Ax(r), Ay) )dt is the Integrated L” distance on the space of Persis- 


tence Landscapes and Ig(X,Y) = fe dy (X(t), Y(t))dt is the Integrated Hausdorff 
distance for dynamic pointclouds. 


The proof is a direct consequence of the Stability Theorem for Persistence Land- 
scapes (Theorem 2), in fact: 


1. Fora fixed 1, consider X(t) and X(t + €) (same applies for Y). By Theorem 2 and 
the continuity of X(t) we have 


< li < li =0. 
0 < lim da (Ax) Ax(+0)) < lim 2dy (X(1),X(+e))=0 
2. Since for a fixed t we have, by Theorem 2 we have 

da (Axi); Avy) < 2d (X(t), Y(t) 


integrating both terms is enough to prove the result. 


In order to show an example of this object with real data, we consider EEG data, 
which are signals recorded at a very high frequency through many different elec- 
trodes (64 in our case). We build the Persistence Surface using EEG signals from 
an alcoholic and a control patient, both under the same stimuli. As we can clearly 
see from Fig. 3 and 4 these two subjects show a very different behavior. While the 
signal from the control patient is strongly characterized by a few persistent features, 
in the alcoholic patient there is less structured, as there are many features but they 
all have a smaller persistence, and could therefore be interpreted as noise. 


Alcoholic Control 


Death 
o 
Death 
1 


Fig. 3 Persistence Diagram of the Alcoholic and Control subjects for a fixed time t. 
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dé 


Fig. 4 Landscape Surface of dimension 1 for the EEG signal of a control patient (top) and an 
alcoholic (bottom) 


References 


. Bubenik, P.: Statistical topological data analysis using persistence landscapes. The Journal of 


Machine Learning Research, 16(1), 77-102 (2015) 


. Carlsson, G.: Topology and data. Bulletin of the American Mathematical Society, 46(2), 255- 


308 (2009) 


. Chazal, F., Fasy, B. T., Lecci, F., Rinaldo, A., Wasserman, L.: Stochastic convergence of 


persistence landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on 
Computational geometry, ACM (2014) 


. Cohen-Steiner, D., Edelsbrunner, H., and Harer, J.: Stability of persistence diagrams. Discrete 


and Computational Geometry 37(1), 103-120 (2007) 


. Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological persistence and simplification. 


Discrete and Computational Geometry, 28(4), 511-533 (2002) 


. Munch, E.: Applications of persistent homology to time varying systems. Diss. Duke Univer- 


sity (2013). 


Modeling of Complex Network Data for 
Targeted Marketing 


Modellazione di Dati di Rete Complessi per il Marketing 
Mirato 


Sally Paganin 


Abstract Developing strategies for targeted advertising of existing customers is a 
common goal in many business sectors, with usual practice focused on identify- 
ing shared acquisition patterns of products based on ownership data. We observe 
customers’ behavior for multiple agencies within the same company, monitoring 
choices of specific products along with co-subscription networks representing mul- 
tiple purchases. Our aim is to exploit co-subscription networks to efficiently in- 
form targeted advertising of cross-sell strategies to currently mono-product cus- 
tomers. We address this goal by developing a Bayesian joint model for mixed do- 
main data which adaptively clusters agencies characterized by a similar customer 
base, exploiting a cluster-dependent mixture of latent eigenmodels to describe multi- 
purchase networks. An application to data from the insurance market is presented. 
Abstract Lo sviluppo di strategie per la pubblicita mirata rivolta ai clienti esistenti 
è un obiettivo comune a diversi settori del mercato, in cui la prassi è quella di identi- 
ficare modelli comuni di acquisto dei prodotti sulla base di dati di possesso. Si è os- 
servato il comportamento dei clienti in diverse agenzie all’interno della stessa com- 
pagnia, monitorando le scelte riguardanti prodotti specifici e reti di sottoscrizione 
rappresentanti gli acquisti multipli. Lo scopo è quello di sfruttare le reti di sotto- 
scrizione per individuare in maniera efficiente quali pubblicità mirate rivolgere ai 
clienti mono-prodotto correnti. Per raggiungere tale obiettivo, si propone un mod- 
ello congiunto bayesiano per dati di natura mista in grado di ragruppare agen- 
zie caratterizzate da una base clienti simile, utilizzando una mistura di modelli a 
distanze latenti dipendente dal gruppo per descrivere reti di acquisto multiplo. Si 
presentata un’applicazione a dati provenienti dal mercato assicurativo. 


Key words: Business Intelligence; Co-clustering; Cross-sell Marketing Strategies; 
Mixed domain data; 
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1 Introduction 


Business statistics is becoming more and more focused on developing new tools for 
the definition of cross-selling campaigns, targeting existing customers instead of at- 
tracting new ones. Adding value to the current customer base has been proved to be 
an efficient strategy for the growth of the company, enhancing customer retention 
by increasing the switching costs. For this reason, mono-product customers pur- 
chasing a single product from a company represent a key segment of the customer 
base, and companies are growing interest in expanding these customers purchases 
to additional products. 

Current methods focus on identifying shared acquisition patterns of products, 
based on customer ownership data sometimes along with additional data such as de- 
mographics records or survey responses, aiming to provide some measure of prod- 
uct subscription propensity [3]. Even if such methods can lead to useful insight 
about the customers purchasing behavior, they are usually limited to provide analy- 
sis for a single portfolio while many companies posses dislocated agencies all over 
a territory. In such settings, efficient targeting of the customer and differentiation 
of the advertising may lead to better profits despite the higher costs. We propose a 
Bayesian joint model for mixed domain data which clusters agencies characterized 
by a similar composition of their mono-product customer choices as well as a com- 
parable multi-purchase behavior. We built on the model presented in [2] providing a 
more general setting. In the next two sections we present the modeling framework, 
while in Section 4 an application to real data is discussed. 


2 Definition of cross-sell strategies 


Customers can be distinguished in mono and multi-product customers, having pur- 
chased one or multiple products among a number of V products. Let yj, € {1,...,V} 
denote the product subscribed from a mono-product customer s, s = 1,...,7; within 
agency i for i = 1,...,n. We represent multi-purchase behavior in each agency 
i as a co-subscription network, described via a V x V adjacency matrix A;, with 
Aiļuv] = Aipu) = 1 meaning that the product u and product v are subscribed together, 
and A;{yy] = Av) = 0 otherwise. 

Definition of the edges between pairs of products may depend on the company 
requirements or, in absence of those, some threshold criteria to be fixed. In our 
application we defined an edge between two products v and u if the number of 
customers subscribing to both products exceeds the 10% of the total number of 
multi-product customers subscribed to at least one of the two. Hence a presence 
of an edge between two products suggests a preference of customers in agency i 
for that specific pair, controlling for the total number of multi-product customers 
subscribed to at least one of the two products. 

We may exploit co-subscription networks in each agency to estimate the propen- 
sity of customer who subscribed to product v to additionally buy u # v, and pair each 
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product v = 1,...V with the one, say uiy, that maximizes such propensity as the best 
choice to offer to a customer who has already product v. However, in order to define 
an effective cross-selling strategy, it is important to take in consideration also the 
proportion of mono-product customer that would be targeted from that strategy. We 
denote such proportion as piy = pr(% = v), with % being the random variable de- 
scribing the choices of the mono-product customer in agency i. Depending on piv, 
strategy uj, may target a small proportion of existing customers, resulting to be less 
efficient than a strategy characterized by a lower estimate of subscription propen- 
sity but targeting a wider portion of the customer base. To take in account the role 
of piy, we associate each strategy uj, with a performance indicator ej, = piyuiy with 
u#vfori=1,...,n. Strategies with a high ej, will target a sizable proportion of the 
available mono-product customers in agency i with advertising for a new product 
likely to be appealing to them. 

Since we observe data for agencies belonging to same company it is reasonable to 
expect them to share some pattern in the mono and multi-purchasing behavior. For 
groups of agencies having sufficiently similar customer bases, an identical strategy 
can be adopted to reduce administrative costs without decreasing effectiveness. On 
the basis of such assumptions, we introduce a clustering underlying mechanism in 
the modeling framework; efficient detection of clusters allows adaptive reduction of 
the total number of strategies from n to K < n. 

We address this goal by proposing a Bayesian hierarchical joint model for the 
data {(vi1,....viv),A;} for i = 1,...,n which characterizes the distribution across 
agencies of the mono-product customer subscriptions along with the co-subscription 
network for multi-product customers. The model is chosen to be flexible while auto- 
matically clustering agencies that have similar mono-product customer choices and 
co-subscription network profiles. 


3 Model 


Let G = (Gi,...,Gy) be the vector of cluster assignments, with G; € {1,...,K} in- 
dicating the cluster membership of agency i. Conditional to the cluster we provide a 
cluster-specific probabilistic representation of the mono-product customer choices, 
as well as a cluster-specific probabilistic generative mechanism underlying the co- 
subscription networks. 

Let px = (px... Pky ) be a cluster-specific probability vector, with pz, indicat- 
ing the probability of subscription to the product k for a mono-product customer 
in an agency belonging to cluster k. Assuming independence of the mono-customer 
choices, the joint probability for the mono-product data given the cluster assignment 
is i 

pP(U=yvn)p(U4=vo)p(4=vv)= [] (1) 


v=1 


with nj, the number of customers in agency i that subscribed to product v. 
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In defining a conditional model for the co-subscription networks within each 
cluster we leverage the probabilistic framework from [1] in which a flexible Bayesian 
model for a population of networks is provided via a mixture of latent eigenmodels. 
We refer to the original paper for details about the model and related prior specifi- 
cation. We complete this last by considering a conjugate Dirichlet prior distribution 
for mono-product choices probability and a nonparametric prior for the cluster as- 
signment vector G. 

In particular we consider a Chinese Restaurant representation of the Pitman- Yor 
process (CRP-PY) [5] with discount parameter d € (0, 1) and concentration parame- 
ter © > —d. Under such representation the conditional prior probability of allocating 
an observation i to one of the already existing clusters is Taie where ny; is the 
number of observations in cluster k excluding the ith one. Instead the conditional 
prior probability of creating a new cluster is ZAK with K~ the total number of 
nonempty clusters after removing the ith observation. Setting d = 0, the Pitman- Yor 
process reduces to the Dirichlet Process with parameter œ, but as the discount pa- 
rameter increases observations are less likely to be allocated to large clusters and 
more likely to be allocated to a new ones. The advantage of such more general 
specification, is that parameter d can be adapted to the company requirements, as 
for example penalizing the creation of new cluster in order to reduce administrative 
costs in advertising a minor number of cross-selling campaigns or, on the contrary, 
favoring it with the aim to provide more specific advertising. 

Posterior computation is available via a simple Gibbs sampler which exploits 
results in [4] to allocate agencies to clusters under the CRP-PY prior and steps in 
[1] to update the quantities describing the co-subscription networks. 


4 Application 


We analyzed subscription data provided from n = 130 agencies selling V = 15 prod- 
ucts belonging to a company operating in the insurance market. Initialization of net- 
work related quantities follows directions in [1], while we center the hyperparam- 
eters of the mono-product probabilities around the averaged preferences of mono- 
product customers in the entire company. We evaluate the clustering behavior under 
the CRP-PY prior by choosing different values for the hyperparameters d and a. In 
particular we consider values for d € {0,0.25,0.5,0.75} and pick the corresponding 
values for œ such that the expected number of clusters a prior is t € {5,15}. 

In our application the clustering behavior appears to be quite robust, producing a 
number of clusters varying between 20 and 25 despite of the prior expected number, 
changing according to the value of the discount parameter d. Results are character- 
ized by the presence of 4 large groups comprising the 60% of the total number of 
agencies, with the parameter d affecting mostly small clusters of 2 or 3 observations. 

In computing estimates of cluster-specific cross-selling strategies and associated 
performance indicators, different clusters result to share similar cross-sell strategies, 
with minor differences in mono- and multi-product customer profiles highlighted 
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across different clusters of agencies. This is a reasonable finding since data comes 
from agencies in the same company, and provides insights on which strategies could 
be advertised for all the agencies and which specific ones are potentially more prof- 
itable. 
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Statistical categorization through 
archetypal analysis 


Analisi archetipale per la definizione di categorie 


Francesco Palumbo and Giancarlo Ragozini 


Abstract Human knowledge develops through complex relationships between cat- 
egories. In the era of the Big Data, categorization implies data summarization in a 
limited number of well-separated groups that must be maximally internally homo- 
geneous at the same time. This proposal exploits archetypal analysis capabilities in 
finding a set of extreme points that can summarize the entire data set in homoge- 
neous groups. Archetypes are then used to identify the best prototypes according to 
the Rosch’s definition. Finally, in the geometric approach to cognitive science, the 
Voronoi tessellation based on the prototypes is used to define a categorization. An 
example on the Forina’s et al. well-known wine data set illustrates the procedure. 


Abstract La capacita di definire relazioni complesse fra categorie é la base della 
conoscenza umana. Nell’era dei Big Data la costruzione di categorie passa at- 
traverso la capacità di riassumere i dati in un limitato numero di gruppi ben distinti 
fra loro e omogenei al loro interno. Questa proposta sfrutta l’analisi archetipale 
per individuare un insieme di punti estremi in grado di riassumere l’intero dataset 
in gruppi omogenei. Gli archetipi sono poi utilizzati per identificare i prototipi, sec- 
ondo l’accezione proposta dalla Rosch. Infine, utilizzando una tassellizzazione di 
Voronoi si definisce la categorizzazione rispetto a ciascun prototipo. Un esempio 
sul data set wine messo a disposizione da Forina et. al. illustra la procedura. 


Key words: Categorization, Archetypal Analysis, Prototypes 


1 Introduction 


Knowledge consists basically of categorizations: humans learn new concepts very 
fast by building complex relationships between a set of complex items or categories. 
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Whilst the total number of objects that can be considered should remains limited to 
five/six, these objects can be described by several features defining an high grade of 
complexity. Categories are stored in our long-term memory, and it has been demon- 
strated that we recall the categories in the working memory developing connections 
among them that improve our knowledge [4]. In other words, few examples of a 
new concept are often sufficient for us to grasp its meaning. On the contrary, we are 
overwhelmed by a large amount of data and information. With the explosion of Big 
Data problems, statistical learning has become a very hot field in many scientific 
areas as well as marketing, finance, and in other environmental and behavioral dis- 
ciplines. The huge amount of stored data represents an incredible source of knowl- 
edge, providing that they can be summarized in a (small) number of categories that 
are consistent with the human cognitive capabilities. 

In the present paper we parallel the cognitive process of categorization through 
statistical learning techniques relying on the conceptual spaces framework [14], in 
which conceptual spaces are geometric structures, and the categorization mainly 
consists in a partitioning process of the conceptual spaces. The paper is structured in 
four sections besides this introduction: section 2 discusses that relationship between 
statistical learning and the construction of a categorization in the cognitive science. 
Section 3 presents the prototypes identification after the archetypal analysis; through 
a real data based example, section 4 presents the Voronoi tessellation [26] starting 
from the prototypes as tool to derive a categorization in the conceptual space; the last 
section presents some concluding remarks and future possible research directions. 


2 Statistical learning and cognitive categorization 


Statistical and machine learning can significantly speed up the human knowledge 
development helping to find the basic categories in a relatively short time. Ex- 
ploratory Data Analysis (EDA) can be considered the forefather of statistical learn- 
ing: it relies on the mind’s ability to learn from the data and, in particular, it aims 
to summarize datasets through a limited number of interpretable latent features or 
clusters offering cognitive geometric models to define categorizations. It can also be 
understood as the implementation of the human cognitive process extended to large 
or huge amounts of data: the “Big Data” [16]. Factorial models belong to the former 
approach, they permit the representation of the original data into a reduced space by 
replacing the original variables with a reduced number of linear mixtures of inde- 
pendent components. These methods include principal component analysis (PCA), 
independent component analysis (ICA) and independent vector analysis, when deal- 
ing with multiple datasets. On the other hand, fuzzy and crisp clustering methods 
allow us to represent each statistical unit as a weighted sum of the means of the 
groups that minimize the overall model error. 

However, EDA itself cannot answer to the questions: “How many, and what are 
the categories to retain?” and “What are the observations that better than others 
can be understood and elaborated in the human cognitive processes?” . In cognitive 
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science, according to Rosch [24, 25], the best is related to the concept of typicality, 
in other words we must look for those elements that better than others can represent 
a category. We call these elements prototypes and measure their representativeness 
degree using a distance function to a salient entity of the category [11, 22]. These 
objects can be observed or unobserved (abstract), and they can be represented by 
a single value or by interval-valued variables. In many cases, in classification and 
clustering, and more generally in cognitive sciences, the concept of prototype has 
been unknowingly adopted to synthesize and represent categories [3, 2]. However, 
dealing with Big Data, the role of prototypes becomes more and more relevant, thus 
giving rise to a wide variety of studies in the literature on prototype-based clustering 
methods (see [17, Chapt 13]). 

Identifying groups that can be connected to a related prototype does not fulfill 
the categorization process. Without any proper description, prototypes cannot be 
advantageous to learning. D’Esposito et al. (2012, 2013) [6, 7] and Ragozini et al. 
(2016) [22] considered the archetypal analysis, as proposed by Cutler and Breiman 
[5], to identify the prototypes in a geometric view. According to the idea of sym- 
bolic object [9], in [7] D’Esposito et al. (2013) proposed the prototype description 
in terms of symbolic objects. The present proposal grounds on the conceptual space 
framework and starting by the geometric properties of the proposed prototypes, ex- 
ploits the Voronoi tessellation to obtain a data-driven categorization, i.e. a partition 
of the conceptual space in convex regions centered on the prototypes. 


3 Prototype identification 


In statistical literature, numerical techniques to find prototypes in a given multi- 
variate dataset have been proposed and are based on several different criteria. The 
most widely used techniques are generally based on non-hierarchical clustering al- 
gorithms [8, 18]. However, in this proposal we present some recent results on the 
prototypes definition through the archetypal analysis. Archetypal analysis (AA), 
was firstly introduced by Cutler and Breiman [5], and it mainly is a matrix factoriza- 
tion method of a generic n x p, random vector X, such that mingg {||X — TBX]||r}, 
where I° and B represent the factorization matrices of order n x k and k x n, respec- 
tively, and ||- ||: states for the Frobenius norm. Matrices B and I" have nonnegative 
entries and must satisfy the following constraints: i) BI, = 1,; ii) l'1% = 1, where 
1 is a vector of ones. The k x p matrix A = BX represents the k archetypes, where k 
is assumed as a priori defined. It is worth nothing that the matrix I” defines a fuzzy 
allocation rule of each data point to the k archetypes: let us indicate with y; the 
general term of I”, with i= 1,...n and j = 1,...k. As Y;%ij = 1, Yij represents the 
membership degree of x; to the archetype aj. 

Setting up structural constraints makes learning more efficient. In other words, 
one can constrain the learning process in a convex space. However, adding structural 
constraints often means that some form of information about the relevant domains 
or other dimension-generating structures is added. Consequently, this strategy pre- 
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sumes a conceptual level in the construction of the prototypes. Archetypal analysis 
exploits redundancies in input data, it finds the number (determined by the user) of 
archetypes in the input data that can be used to represent (approximate) data points. 
It is worth noting that archetypal analysis constraints ensure symmetrical relation- 
ship between archetypes and data points: archetypes are convex combinations of 
data points and data points are approximated in terms of convex combinations of 
archetypes. 

In this view, we propose a geometric approach that allows prototypes identifi- 
cation as the most typical object within a group or a category. A prototype is the 
member within a group that best represents the other members (i.e., internal resem- 
blance), and that at the same time differs from the members of the other groups 
or categories (i.e., external dissimilarity). This double semantics related to cen- 
trality and extremeness can been operationalized through a typicality index T(-,-) 
(23, 13, 19, 20]. 

Formally, given a set of n objects Q = {x;}i=1,...n, x; € RP and a partition € = 
(C1,...,Ck) of Q in K groups, an internal resemblance measure R(x;,Ch) of x; w.r.t. 
xy € Ch, an external dissimilarity measure D(x;, Ch) of x; w.r.t. xy € Cy, and a mixing 
function P(-) that combines both measures, a typicality index T(x;,C,) of x; with 
respect to the class C; is given by: 


T (xi; Ch) = P(R(x;, Ch): D(x;,Ch)). 0) 
The set of prototypes Y = (p1,..., px) is then defined as: 


P = {pn E€ R? |p; = argmaxT(x;,Ch),h=1,...,K}. (2) 
xi 


It is clear that, in this framework and setting, the prototype identification de- 
pends on the ways in which dissimilarity and resemblance are measured, and on the 
partition that is assumed to be known in advance. The main proposals in this di- 
rection for prototype identification assume that both resemblance and dissimilarity 
measures are based on the Euclidean distance. The semantic of prototypes is also 
strongly affected by the choice of the mixing or aggregating function ®(-,-). If one 
considers only the internal resemblance, the prototypes will be the central elements 
of the groups; on the other hand, if one takes into account only the external dissim- 
ilarity, the prototypes will be the most extreme points. The mixing function ®(-,-) 
yields a compromise between these two instances. 


4 Categorization by Voronoi tessellation: the wine data-set 


In the conceptual space framework, the categorization problem can be solved by a 
partitioning of the space through the Voronoi tessellation starting by a given set of 
prototypes. In our approach, we provided a way to derive prototypes from data [22]. 
We note that the geometrical properties of our prototypes are congruent with the 
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conceptual space approach, and then we propose to use our data-driven prototypes 
for the Voronoi tessellation in order to obtain a categorization. In addition, in cog- 
nitive science it is often assumed that the number of prototypes and then typologies 
in the data is a priori known. However, in any real world cognitive study, things 
are completely different and the true number of typologies must inferred studying 
the groups in the data, albeit to decide the number of groups is one of the most 
widely addressed problems in cluster analysis, and most likely it has no satisfactory 
solution that can be generalized to any category of problem. Dealing with extreme 
data points, AA allows us to choose the number of archetypes according to the be- 
havior of the loss function evaluated at different number of archetypes. The loss 
function is plotted on a Cartesian coordinate system, where the x-axis represents the 
number of archetypes and the y-axis the value of the loss function (decreasing by 
definition), the optimal number of archetypes should be revealed by an elbow of the 
function (graphically: the loss function begins to be parallel to the x-axis). However, 
the presence of multivariate outliers or highly correlated variables could mask the 
true number in favor of a redundant and not stable solutions. Deeper investigations 
based on computationally intensive studies can reveal such a kind of situations. 

In this section we consider the wine dataset. Firstly presented by Forina et 
al. [12], it contains data of 178 wines produced from three different Italian culti- 
vars (barbera, barolo and grignolino) and described by 13! features that refer to 
organoleptic and chemical categories. As the three different varieties of wines are 
recognized as having own specific properties, we assume that each of them repre- 
sents a category and can be summarized by a prototype. 

The first step consists in the archetypes identification. The package archetypes 
[10], available at CRAN repository, permits to identify the optimal number of 
archetypes, here we set the number of archetypes equal to three. We refer the in- 
terested reader to [22] for a more detailed description on the choice of the number 
of prototypes. Table | reports the three archetypes described by their thirteen origi- 
nal variables (expressed in their own original scales). 


Alc Mal Ash Alk Mag Phe Fla NFla Pro Col Hue Dil Prol 
aj 14.19 1.97 2.51 16.45 114.63 3.24 3.40 0.26 2.21 6.68 1.05 3.28 1316.07 
a2 13.22 3.78 2.48 22.12 97.47 1.56 0.65 0.49 1.05 7.69 0.63 1.51 621.94 
az 11.79 141 2.07 20.04 86.50 2.26 1.97 0.34 1.61 2.15 1.20 3.08 406.40 


Table 1 Wine data: Archetypes as first solution. 


The second step consists in grouping the points around the archetypes in the 
space defined by the matrix I”. In such example a crisp classification has been taken 
into account. A fuzzy allocation rule can also be taken into account, it can ensure 
higher “purity” degree in the groups and (generally) produces an extra group with 
respect to the number of archetypes. The three groups, corresponding to the three 


! 1) Alcohol, 2) Malic acid, 3) Ash, 4) Alkalinity of ash, 5) Magnesium, 6) Total Phenols, 7) 
Flavanoids, 8) NonFlavanoid phenols, 9) Proanthocyanidins, 10) Color intensity, 11) Hue, 12) 
0D280/0D315 of Diluted wines, 13) Proline. 
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archetypes, are visualized in the space spanned by the three columns of I" in the 
figure 1. 


Fig. 1 Wine data set: groups around the archetypes obtained by the crisp allocation rule. 


The group centroids are identified by the generalized compositional geometric 

mean of the group computed from the ¥;; membership scores. Exploiting the rela- 
tionship between the geometric basis spanned by the archetypes and the original 
space [1], prototypes can be represented in the original variable space. 
It can be shown that in a metric space the representation of properties is obtained 
as convex regions. Let us consider the set of prototypes Y = {p1, p2,..., px}, their 
representation in any conceptual space implies (according to the definition of pro- 
totype itself) that they are the central points in the categories they represent. The 
distance between any prototype point p and p’ represents their external dissimilar- 
ity. If we assume that any generic point x; belongs to the same category as the closest 
prototype, it can be shown that this rule will generate a partitioning of the space into 
convex regions [15, 21]. This partition/categorization is given by the Voronoi tessel- 
lation of the conceptual space based only on the prototypes. Note that this approach 
has also computational advantages. The tessellation is performed using only few 
points, i.e. the prototypes, and, given the geometric properties of the Voronoi tes- 
sellation, the allocation on new instances to a given category can be done in a very 
easy and efficient way. 

The two plots in figure represent the Voronoi tessellation on the first two prin- 
cipal components (29% of the total variance). The plot (a) summarizes the entire 
categorization process: (i) the triangle vertices represent the three archetypes; (ii) 
the blue points (larger than the other points) refer to the prototypes; (iii) the dashed 
lines converging in the center define the convex regions associated to the three cat- 
egories. It is worth noting that the prototypes appear more internal with respect to 
the corresponding archetypes. 

The plot (b) on the right hand side shows the entire tessellation around the three pro- 
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totypes and developed with respect to the 178 observed points. It is easy to note that 
the categorization given by the tessellation reproduce well the three wine typologies. 


(a) Archetypes, prototype and Voronoi convex (b) Voronoi tesselation 
regions 


Fig. 2 Wine data set: Plots a) and b) represent the Voronoi tessellation and the convex geometric 
region on the first two principal components. In Figure (a) the red triangle vertices represent the 
archetypes, the blue points refer to the prototypes and the dashed lines represent the edges of the 
convex regions that correspond to three categories. 


5 Conclusion 


Several alternative cognitive approaches are grounded on the geometric represen- 
tation between properties and concepts in convex conceptual spaces. Like in the 
Voronoi tessellation, our method allows a partitioning of the convex conceptual 
space into convex regions, which is based on the Euclidian metric. Thus, assuming 
that a Euclidean metric is defined on the subspace that is subject to categorization, 
a set of prototypes will generate a unique partitioning of the subspace into convex 
regions by this method. The upshot is that there is an intimate link between pro- 
totype theory and criterion. Furthermore, the metric is an obvious candidate for a 
measure of similarity between different objects. In this way, the Voronoi tessella- 
tion and archetypes categorization provide a constructive geometric answer to how 
a similarity measure and a set of prototypes determine a set of categories. 
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Inference with the Unscented Kalman Filter and 
optimization of sigma points 


Inferenza con Unscented Kalman Filter e ottimizzazione 
dei punti sigma 


Michela Eugenia Pasetto, Umberto Noé, Alessandra Luati, Dirk Husmeier 


Abstract We investigate the accuracy of inference in a chaotic dynamical system 
(Duffing oscillator) with the Unscented Kalman Filter and quantify the dependence 
on the sample size and the signal to noise ratio. In order to improve convergence to 
the true parameters in the case of a bad initialization of the algorithm, we optimize 
the location of sigma points with Bayesian optimisation. 

Abstract Si studia l’accuratezza d’inferenza in un sistema dinamico caotico (oscil- 
latore di Duffing) con l’ Unscented Kalman Filter e si quantifica la dipendenza dalla 
numerosità campionaria e dal rapporto segnale-rumore. Per migliorare la conver- 
genza ai veri parametri nel caso di una cattiva inizializzazione dell’algoritmo, si 
ottimizza la posizione dei punti sigma in modo Bayesiano. 


Key words: Bayesian filtering, Unscented Kalman Filter, Chaotic dynamical sys- 
tem, Bayesian optimisation, Gaussian Process 


1 Introduction 


We analyse the deterministic Duffing process, defined as 
dxy/dt = xx, dxy/dt = — (cxu + 0x1 + Bx), (1) 


where x], and xy, are the position and the velocity, respectively, of the oscillation 
at time f, g(x) = ax + Bxj, is a restoring force, œ is the natural frequency of 
the vibration, B the mode of the restoring force (hard or soft spring), and c is the 
damping term. The Duffing system (1) describes a periodically forced oscillator 
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with a nonlinear elasticity, and has been widely used in physics, economics and 
engineering (Kovacic and Brennan, 2011). A characteristic feature is its chaotic 
behaviour, which makes statistical inference challenging. In the present paper we 
present an approach based on the Unscented Kalman Filter (UKF). 
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Fig. 1 UKF estimates for the deterministic Duffing system with SNR=31 and n = 1000. (a) Signal 
estimate. (b) Estimate of parameter a. (c) Estimate of parameter f. (d) Estimate of parameter c. 


2 Methodology 


The UKF algorithm is a non-linear generalization of Kalman filter which relies on 
the unscented transform (Julier and Uhlmann (2004)) in order to construct a Gaus- 
sian approximation to the filtering distribution. The UKF performs a Bayesian esti- 
mation of a state-space model: 


x, = f(X1)+ E, y, =hlx) +N, (2) 


where x, € RM is the (hidden) state at time t, y, € R? is the measurement, € ~ 
N(0,Ze) is the Gaussian system noise and n ~ M(0,Zn) is the Gaussian obser- 
vation noise. The non-linear differentiable functions f and h are, respectively, the 
transition and observation models. UKF passes a deterministically chosen set of 
points (sigma points) through f to obtain the predictive distribution p(x;|y,.;_1)- 
Then, the sigma points are transformed using model h to compute the filtering 
distribution p(x;|y}.,). As suggested in Sitz et al. (2002), we merge the signal 
with the parameter vector A = [æ B c]F in a joint state vector j, = [xx, Ai]? = 
[(f (x11, 41-1) + €), 4:-1]7, and y, = A(j,) + n. In our case, the function f of 
model (2) is given by the numerical solution of system (1), h is the identity func- 
tion, and € = 0. 


3 Simulations 


We simulate system (1) through the ode23 MATLAB function with a stepsize of 
integration ôt = 0.01 and starting values for the numerical integration [1,0]. Mea- 
surements are obtained from the first component, x1;, by adding observational noise 
m ~ N(0, 07) with known variance. The time interval is t = 1,...,20, and the pre- 
sented results are averaged over 10 simulations. The UKF algorithm is performed 
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Fig. 2 UKF estimates for the deterministic Duffing system with SNR=10 and n = 1000. (a) Signal 
estimate. (b) Estimate of parameter a. (c) Estimate of parameter f. (d) Estimate of parameter c. 
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Fig. 3 UKF estimates for the deterministic Duffing system with SNR=1 and n = 1000. (a) Signal 
estimate. (b) Estimate of parameter a. (c) Estimate of parameter f. (d) Estimate of parameter c. 
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Fig. 4 UKF estimates for the deterministic Duffing system with SNR=10 and n = 100. (a) Signal 
estimate. (b) Estimate of parameter a. (c) Estimate of parameter f. (d) Estimate of parameter c. 
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Fig. 5 UKF estimates for the deterministic Duffing system with SNR=10 and n = 50. (a) Signal 
estimate. (b) Estimate of parameter a. (c) Estimate of parameter f. (d) Estimate of parameter c. 


with the EKF /UKF toolbox of Hartikainen et al. (2011). To investigate the behaviour 
of the Duffing process and the UKF performance, we have simulated several sce- 
narios, varying the Signal to Noise Ratio, SNR € {30, 10,1}, and the sample size, 
n € {1000, 100,50} (Figures 1-5). To evaluate the impact of initialization, we con- 
sidered different offsets (low, medium and high) as starting values for the parame- 
ters. The offsets are sampled randomly from a Gaussian distribution in which the 
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Table 1 Impact of the initialization for the deterministic Duffing system for different offsets (as 
percentage of the true parameter values) in term of Euclidean norm prior inference and post infer- 
ence. Default sigma points. 


a p c 


Prior Post Prior Post Prior Post 


100% 1.05 0.49 2.37 1.69 0.08 0.02 
250% 2.71 0.22 5.16 9.24 0.22 0.02 
400% 3.83 1.94 8.27 9.10 0.54 1.79 


mean is defined by a percentage deviation from the true parameter values and the 
variance is 10% of the mean (Table 1). 


4 Optimization of sigma points 


The sigma points location in the UKF algorithm is parametrised by three scalar val- 
ues 0 = (Qukf, Burt, kukt). These parameters are heuristically set by the algorithm, 
and the default values for model (1) are Qukf = 1, Bukf = 0, Kukf 3. However, 
the positioning of the sigma points affects the overall inference performance of the 
UKF method and its convergence. We optimize the sigma points location by min- 
imising the loss function L(@) using Bayesian optimisation, in order to improve the 
convergence of UKF to the true differential equation parameters even in the case 
of a bad initialization. The Bayesian optimisation algorithm iteratively maintains a 
statistical emulator of the objective function L in Figure (6) and chooses the next 
“best” point 0 = (Que, Buk, kukt) by maximising an auxiliary acquisition function 
derived from the current emulator. The emulator of the loss function L is given by a 
Gaussian Process (GP) with constant mean function and Matérn v = 5/2 covariance 
function, which leads to twicely differentiable sample paths. The GP parameters are 
estimated by maximum log marginal likelihood. Given the GP at the current itera- 
tion, Ê ~ GP(m,s), the acquisition function used in this study is given by a weighted 
version of the Expected Improvement (ED): 


where © and @ denote the cdf and pdf of a N(0,1) random variable. We follow 
the approach discussed in Noé et al. (2017) which weights the EI acquisition (3) 
with the probability of a successful objective function evaluation. This allows us 
to account for failure in the evaluation of L due to matrix singularities, and still 
optimise it when standard optimization algorithms would fail to. The weighted EI 
acquisition function balances exploitation, where the GP mean m(@) predicts a low 
function value, and exploration where the GP predicts high uncertainty s?(@). The 
acquisition function is optimized using the Nelder-Mead algorithm on the 10 start 
points having lowest acquisition function value between 10* random starting points. 


Inference with the Unscented Kalman Filter and optimization of sigma points 711 


Table 2 Euclidean norm prior inference and post inference with Bayesian optimisation. 
a B g 
Prior Post Prior Post Prior Post 


100% 1.05 0.37 2.37 1.25 0.08 0.02 
250% 2.71 1.82 5.16 6.05 0.22 0.06 
400% 3.83 0.19 8.27 5.46 0.54 0.09 


Table 3 Standardized Euclidean norm in the parameter space: comparison between the default 
algorithm parameters and Bayesian optimisation (BO). 
Default BO 
Prior Post Prior Post 


100% 1.76 1.01 1.76 0.76 
250% 4.33 4.63 4.33 3.59 
400% 7.83 18.62 7.83 2.90 


Fig. 6 Loglikelihood of measurements for different offset. (a) High offset. (b) Medium offset. (c) 
Low offset. The white spaces are due to numerical instability when inverting the Kalman gain 
matrix. 


5 Results 


Figures 1-5 show that the UKF successfully learns the parameters from the noisy 
data, and that at the end of the filtering phase the true parameters always lie within 
the predicted standard error around the estimate. This suggests that Bayesian filter- 
ing offers a successful paradigm for inference in chaotic dynamical systems. The 
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Fig. 7 UKF estimates for the deterministic Duffing system with default sigma points. (a) Signal 
estimate. (b) Estimate of parameter a. (c) Estimate of parameter f. (d) Estimate of parameter c. 
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Fig. 8 UKF estimates for the deterministic Duffing system with optimized sigma points in the case 
of high offset. (a) Signal estimate. (b) Estimate of parameter œ. (c) Estimate of parameter f. (d) 
Estimate of parameter c. 


prediction uncertainty depends on the sample size n, and the level of noise, quanti- 
fied by the SNR. As one would expect, the uncertainty increases with decreasing n 
and decreasing SNR, i.e. as information in the data is lost, and our study allows a 
quantification of this trend. The increase in uncertainty particularly affects the pa- 
rameter B, which is associated with the nonlinear term and the source of the chaotic 
behaviour. Tables 1-3 show improvement in the convergence to the true values, mea- 
sured in terms of the Euclidean distance in parameter space, due to the optimization 
of sigma points through Bayesian optimisation. This distance is consistently reduced 
with the optimized sigma points, suggesting that, even in the case of a bad initial- 
ization of the UKF algorithm, optimize the sigma points will improve the inference 
results. 
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Pairwise Likelihood Inference for 
Parameter-Driven Models 


Inferenza basata sulla verosimiglianza a coppie per 
modelli ‘parameter-driven’ 


Xanthi Pedeli and Cristiano Varin 


Abstract This paper discusses likelihood-type inference in parameter-driven mod- 
els for regression analysis of non-normal data in presence of serial correlation. Since 
the ordinary likelihood function involves an intractable high-dimensional integral, 
we consider a pairwise likelihood approach that requires to approximate a limited 
set of two-dimensional integrals. Maximization of the pairwise likelihood is car- 
ried out with a pairwise version of the expectation-maximization algorithm. The 
methodology is illustrated with surveillance data to evaluate the relationship be- 
tween influenza and menigoccocal infections. Results are in close agreement with 
Bayesian inference based on the integrated nested Laplace approximation. 
Abstract Questo articolo discute l’inferenza basata sulla verosimiglianza a cop- 
pie nei modelli ‘parameter-driven’ usati per analisi di regressione con dati non 
normali in presenza di dipendenza temporale. Siccome la verosimiglianza ordi- 
naria è data da un integrale di alta dimensionalità che non ha soluzione in forma 
chiusa, abbiamo considerato un approccio basato sulla verosimiglianza a coppie 
che richiede di approssimare un limitato insieme di integrali bivariati. La massimiz- 
zazione della verosimiglianza a coppie è effettuata tramite una versione a coppie 
dell’algoritmo ‘expectation-maximization’. La metodologia è illustrata con l’analisi 
di dati di sorveglianza epidemiologica per valutare l’associazione fra l’influenza e 
le infezioni da meningococco. I risultati in questa applicazione sono molto simili 
a quelli ottenuti in ambito Bayesiano usando il metodo ‘integrated nested Laplace 
approximation’. 


Key words: Expectation-maximization algorithm; Pairwise likelihood; Parameter- 
driven models; Surveillance; Time series of counts. 
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1 Introduction 


Parameter-driven models [1] are frequently used for regression analysis of non- 
normal data in presence of serial correlation. This class of models assumes that time 
series observations Y, are independent random variables conditionally to a latent 
process U, designed to describe the serial correlation. Let p(y;|u; 0) be the condi- 
tional density or probability function of Y, given {U; = u} and p(u1,...,un;0) the 
joint density function of the latent variables, commonly assumed to be multivariate 
normal. The distributions depend on a p-dimensional parameter @. The likelihood 
function for @ is the n-dimensional integral obtained by integrating out the latent 
variables 


co Nn 


L(@) af. -f J| pO:lu;0)p(u,.--, un; 0) du) -+ dun. (1) 
wi FERMI 


A variety of simulation and non-simulation methods have been proposed for ap- 
proximate inference in parameter-driven models, see [2] for a review. 

In this paper, we study inference in parameter-driven models through the pair- 
wise likelihood of order d [3], constructed by pooling together bivariate marginal 
distributions 


n d 
d 
LP (0) = I] []iogp(v:,9110). 2) 
t=d+1 i= 


where each component is a two-dimensional integral 


Ponn-18) = f J P(yr|Urs O)P(Yi-i|ur-i; O) p(ur, uri; 0)du day. 


The merit of the pairwise likelihood is to replace the intractable n-dimensional in- 
tegral of the full likelihood (1) with a set of bivariate integrals. Under model con- 
ditions, the maximum pairwise likelihood estimator of order d is consistent with 
asymptotic normal distribution 


G(0)!/2(8® — 0) >, MVN(0, Ip), 


where MVN(0,I,) is a p-dimensional multivariate standard normal distribution and 
G(0) is the Godambe information [3], 


G(@) = H(@)J(0) !H(0), 


with H(0) = —E{V?e(6)} and J(@) = var{ V4 (@)} where (0) = log.L2(0) 
is the log-pairwise likelihood. 

The maximum pairwise likelihood estimator can be computed using a pairwise 
version of the expectation-maximization algorithm. The algorithm constructs a se- 
quence of estimates 6 through maximization of the expected pairwise complete 
log-likelihood, 
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0(9)64D) = J Í log P(r, Yii, Ut, Ut—1; 0)p(u,u-1 Davis 0%!)dudu;-1 . 


The bivariate integrals involved in the expected pairwise complete log-likelihood are 
approximated with a double Gauss-Hermite quadrature. The pairwise expectation- 
maximization algorithm is particularly convenient for inference in parameter-driven 
models because 6“) is partially available in closed-form. 


2 Application 


The analysis of time series of infectious disease counts is of particular interest owing 
to the special features that they present, which include long-term trends, seasonality 
and occasional outbreaks. Apart from these features, special links between several 
infectious diseases form an additional source of information for biosurveillance pur- 
poses. A characteristic example is the association between influenza infection and 
meningococcal disease with the former being a well-known risk factor for the latter, 
see for example [4]. 

Figure 1 displays the weekly counts of meningococcal disease and influenza in- 
fection cases in Germany for the years 2001 - 2006. The data come from the Ger- 
man national surveillance system for notifiable diseases, administered by the Robert 
Koch Institute, and consist of n = 312 observations. It is clear in Figure 1 that both 
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Fig. 1 Weekly counts of influenza infection (top panel) meningococcal disease (bottom panel) 
cases in Germany for the period 2001-2006. 
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diseases display a seasonal pattern with evident outbreaks during the winters of 2003 
and 2005, while there is no indication of trend. 

The standard analysis of this type of surveillance data assumes that the meningo- 
coccal disease counts Y; are marginally distributed as independent Poisson random 
variables with mean exp(1;) specified in way to account for annual seasonality and 
the potential association with influenza infection, 


th = Po + Bı cos (275) + By sin (223) + Bslog(Flu, +1). @) 


The transformation log(x+ 1) is used for reducing the right skewness of influenza 
infection data after its transformation to strictly positive values. 

In order to handle for the presence of serial correlation, we use the pairwise 
expectation-maximization algorithm for fitting a parameter-driven model that as- 
sumes that Y, follows a Poisson distribution with mean exp(1 + U;), where the 
linear preditor n; is specified as in (3) and U, is the first-order autoregressive model 


U, = @U;-1+08&, & ~N(0,1). 


The order d of the pairwise likelihood to be maximized is chosen among a reason- 
able set of candidate orders based on the criterion of efficiency. 

The parameter estimates and the corresponding standard errors obtained with 
the standard analysis and the parameter-driven model are displayed in Table 1. 
Inference for the parameter-driven model is based on a pairwise likelihood of or- 


Table 1 Parameter estimates (standard errors) for models fitted to the weekly counts of meningo- 
coccal infections in Germany for the period 2001-2006. PL (d = 7) stands for pairwise likelihood 
of order d. 


Poisson GLM —Parameter-driven 
PL(d=7) INLA 


Bo 2.10 (0.04) 2.12 (0.04) 2.11 (0.07) 
Bi 0.14 (0.03) 0.16 (0.02) 0.15 (0.05) 
È. 0.24 (0.04) 0.27 (0.02) 0.27 (0.07) 
B, 0.06 (0.02) 0.05 (0.02) 0.05 (0.02) 
6? 7 0.03 (0.01) 0.02 (0.01) 
6 a 0.70 (0.09) 0.68 (0.12) 


der d = 7 that gives the higher efficiency among all orders from 1 to 10. For com- 
parison purposes, we consider also Bayesian inference using the integrated nested 
Laplace approximation (INLA) [5], as implemented in the R [6] package R-INLA 
(www.r-inla.org). 

Results of the standard analysis indicate significant seasonality and risk effect of 
influenza infection for meningococcal disease. The fitted parameter-driven model 
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confirms these findings and provide further evidence of significant autocorrelation. 
Pairwise likelihood and INLA provide similar parameter estimates, but standard 
errors of the maximum pairwise likelihood estimates are sensibly smaller than those 
based on INLA thus suggesting a higher precision for the proposed method in this 
particular application. 
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Felicia Pelagalli, Francesca Greco and Enrico De Santis 


Abstract In this paper we present an investigation of the emotional content con- 
veyed by words in online conversations captured on Twitter. A multivariate tech- 
nique applied to co-occurence of words together with Correspondence Analysis is 
adopted in order to find clusters of meaningful words detecting emotional categories 
that provide meaning to everyday events. Specifically, given the current historical 
period, where the European Union has to gain trust in its citizens, a corpus of 155000 
tweets selected through the Italian keywords “Europa” and “EU” is analyzed. Re- 
sults show clearly how the textual content is structured according to the different 
emotional expressions. 

Abstract In questo articolo è presentata un’analisi testuale che esplora il contenuto 
emozionale delle parole nelle conversazioni su Twitter. È stata adottata una tecnica 
di analisi multivariata applicata alla co-occorrenza delle parole assieme all’analisi 
delle corrispondenze al fine di raggruppare le parole in cluster di significato e in- 
dividuare le categorie e le emozioni che danno senso agli eventi — ossia, i signifi- 
cati attribuiti agli eventi dagli attori partecipanti a un determinato contesto. Dato 
il particolare periodo storico in cui versa l’ Unione Europea, che si trova a dover 
guadagnare la fiducia dei propri cittadini, è stato preparato ed analizzato un cor- 
pus di 155000 tweet selezionati attraverso le keyword “Europa” ed “EU”. I risultati 
mostrano chiaramente come il contenuto testuale è strutturato secondo le differenti 
espressioni emozionali del fenomeno. 
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Key words: Text Mining, Social Data Mining, Multivariate Analysis, Correspon- 
dence Analysis, Clustering. 


1 Introduction 


With the spread of social networks and micro-blogging platforms, statistical method- 
ologies boosted with machine learning techniques find their natural habitat in the sea 
of available online data. In fact, related techniques enable us to perceive the feel- 
ing that runs through the network. An overwhelming quantity of conversations are 
exchanged, mostly through words in a written form. If from one side it can be pos- 
sible grasping the opinions underlying the online social exchanges, from the other 
it is clearly interesting to have a measure of the emotional significance that gives 
meaning to social phenomena. Now more than ever, this knowledge can help in- 
stitutions and community managers to realize people needs and problems. It is the 
emotion that drives us in making relation with the objects of a given context on 
the basis of affective symbolizations and social representations. Hence, in convey- 
ing emotions, words show the functioning structure of the mind-brain, according 
to a dual logic [1]: i) the asymmetrical conscious thought which allows entering in 
a relationship with a context or event; ii) the symmetrical emotional thinking that 
the context or the events immediately arouses within us. Thus, the content analy- 
sis of conversations has to catch and externalize the emotional “density” conveyed 
by words or chains of words, through suitable knowledge models substantiated by 
statistical techniques, such as the multivariate analysis. In fact, the latter, as an un- 
supervised technique, can find recurrences, relations between nodes of a network 
or can help grouping words in meaningful clusters, detecting emotional categories 
that provide meaning to everyday events. According to this framework, the linguis- 
tic communication can be interpreted not only on the basis of its semantic elements 
but also through the emotional framework that yields value to a given text. This 
context fits with the co-occurrence analysis of words, used as the first step of our 
investigation, to find associative links among words. In this study we analyze online 
conversations trying to discover how they are organized within the current social 
context and upon a given object represented by a set of keywords. Specifically, the 
corpus consists in 155000 tweets gathered, in the time period ranging from January 
11, 2017, to February 11, 2017, trough the Twitter API, filtering the stream by the 
Italian keywords “Europa” and “UE”. The corpus is analyzed through a pipeline of 
statistical and learning techniques briefly described in next section. Specifically, in 
order to obtain a thematic analysis based on the co-occurrence of lexical units upon 
the corpus at hand, a mapping of the latter in the Vector Space Model (VSM) [2] 
is performed. The k-means algorithm is then adopted obtaining a suitable partition 
through the cosine dissimilarity measure between word vectors. Finally, the Boolean 
contingency matrix, describing documents membership to the retrieved clusters, is 
analyzed with the well-known Correspondence Analysis (CA) technique. 

The current paper is organized as follows. In Sec. 2 we provide a brief summary 
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of the adopted methodology, while in Sec. 3 main results are discussed. Finally, 
conclusion are drawn in Sec. 4. 


2 Material and methods 


To finalize the herein proposed investigation, data is cleaned and pre-processed. In 
particular, instead of raw words, lemmas as main categories are used. Subsequently, 
the the most common words and the very rare words are filtered out. Lemmatization 
and filtering allows to obtain a more compact VSM, reducing even the sparsity of 
the model. We note that in the current section the formal terms “document” and 
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“context” are interchangeable, such as “term”, “word” or “lexical units”. Following 
SN Corpus Contexts Clustering Corr. Anal Emotional fram 


Fig. 1 Schematic diagram of the adopted methodology for measuring the emotional structures 
underlying the online conversations. 


the diagram of Fig. 1 the analysis presented is centered on the VSM [2], a particular 
vector or distributional model of meaning. VSM is based on a co-occurrence matrix, 
i.e. the word-document matrix, that is a way of representing how often words co- 
occur. From a methodological point of view the VSM embeds information retained 
within a corpus in a vector space representation, substantiating the distributional 
hypothesis according to which words that occur in similar contexts tend to have 
similar meanings. Lets define the term-document matrix X = [dj ,do,...,dp] where 
the content of each document vector dj = [w1,W2,..., Wy] is represented as a vector 
in the term space of dimension V that is usually the dimension of the vocabulary. A 
standard weighting scheme, used in the current work for w;, is the the tf-idf (term 
frequency-inverse document frequency) [3], that provides higher weights to terms or 
words that are frequent in the current j-th document but rare overall in the collection. 

In order to measure the similarity between two documents d, and d, enabling the 


cluster analysis, a well-suited similarity measure is used. It is the cosine similarity, 
that is sim(dp, dg) = cos (dp, d4) = Tada 

The k-means algorithm is a partitional clustering algorithm [4, 5] based on 
squared error optimization approach. Specifically, given a set of objects (word vec- 
tors) X = {dj li; € R”, where V is the dimension of data vectors, it finds a suitable 
partition P = {%,@,..6} so that the sum of the squared distances between objects 
in each cluster and the respective representative element is minimized: 
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where c; is the representative of the i-th cluster 6. Belonging to the family of 
the NP-Hard problems, a complete analytical solution is not know and k-means 
as greedy algorithm, can only converge to a local minimum. 

CA is a statistical method useful for data visualization that is applicable to cross- 
tabular data such as counts, compositions or any ratio-scale data. In this work, it 
is performed on the Boolean contingency matrix describing the partition P [6]. Let 
P denote a qr X ge data matrix with non negative elements that sum up to 1, i.e. 
17 P1 = 1, where in general 1, is a q-dimensional vector of ones and T is the 
transpose operator. The CA is formulated as the following least-squares problem: 


5 2 
min |ë- Dy? AB"D: | (2) 


where P = j (P — re’) p;!, r=P1,,,€= Prig D, and D, are correspond- 
ing diagonal matrices. The column coordinate matrices A and B are of rank k that 
is the dimensionality of the approximation. By imposing B’D,.B = Ir, it is pos- 
sible obtaining a solution through the well-known Singular Value Decomposition: 
P = UAV’, where A is a diagonal matrix with in descending order the singular 
values on the leading diagonal and U and V are orthonormal matrices. A least- 
squares approximation of P is obtained by selecting the first k columns of U and 
V and the corresponding singular values in A. Finally, the coordinate matrices are 
A= D" ?UA and B = De 1y ?y, so that ATD,A = A?. Given the coordinate matri- 
ces the row coordinates are referred to as principal coordinates whereas the column 
coordinates are standard coordinates. The two sets of coordinates are also known 
as biplot and the inner-product DI/ °2ABIDI/ ? în (2) approximates the data. If the 
matrix P constitutes a contingency table, P is the matrix of standardized residuals, 
i.e. the matrix of standardized deviations from the independence model. Hence, a 
low-dimensional approximation of these standardized residuals is given by the bi- 
plot coordinates in A and B. In other words, it can be shown that this biplot will 
approximate, by euclidean distances on the plot, chi-square distances in P. Chi- 
square distance is mathematically the euclidean distance inversely weighted by the 
marginal totals. 


3 Results 


As concerns the cluster analysis the cardinality k of the partition P is set to 5. In 
Tab. 1 are reported the explained variances for each principal components that here- 
inafter are named “factors”. In Fig. 2 we can appreciate the emotional map of the 
Europe coming out from Italian tweets. It shows how discovered clusters are placed 
in the factorial space, whereas in Tab. 2 is reported the factors—clusters matrix that 
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summarizes our main findings. The emerging map shows on the horizontal plane a 
sharp contrast between the “political power” and the “populist protest”. The cluster 
of words @ sees the chill and sooty European institutional places that are perceived 
as a remote center of power in which citizens do not definitely recognize them- 
selves. The theme is the election of Antonio Tajani as president of the European 
parliament and Pittella defeat. Congratulations words, but even disappointment and 
irritation for who does not feel represented (dividere, urtare, sensibilita, impera). 
On the opposite side, a strong sense of helplessness regarding the big problems, 
such as immigrants and the economic crisis. 63 is characterized by the UE plan 
proposed in order to stop the sea blockade in front of Libyan territories. We have 
also tweets where the Italian Economy ministry is perceived as “unable”, while the 
former prime minister Matteo Renzi together with Angelino Alfano (current Ital- 
ian foreign minister) are considered “hypocrites”. Another emerging contrast on the 
vertical plane is the “success of the economic power” and “people problems”. From 
it emerges a two-speed Europe and the “economic power” represented by Ger- 
many with the chancellor Anghela Merkel and the president of the European Central 
Bank Mario Draghi. It is a strong power (velocità, vincere) that cohabits/forgets the 
human tragedies (permettere, vergognarsi). On the opposite side, 65 refers to the 
necessity of funds for places hit by the earthquake. Furthermore, it shows clearly 
the arising of new political movements, such as the one referred to Marine Le Pen 
in France, evidencing tension, betrayal, isolation and risks for Europe. Finally, in 
a (in a middle position on the map) we find the ambivalence fear/anguish related 
to the dichotomy opening—closing, where closing seems to prevail together with the 
fantasy of closing themselves off in the localism to avoid chaos. This is a cluster 
full of fears that undermine the Altiero Spinelli’s project for a united Europe. 64 is 
close the origin of axes on the factorial map, in fact it contains basic emotions that 
seem to span all the facets of the underlying discourse. 


Table 1 Explained variance for each factor. 


Ind Eigenvalues % Cumul. % 
1 0.1538 29.54 29.5438 
2 0.1308 25.14 54.683 

3 0.1214 23.32 77.9995 
4 0.1145 22.00 100 


4 Conclusion 


The current paper presents an analysis of a huge corpus of tweets in Italian language 
based on a set of statistical techniques, specifically a Cluster analysis and a Corre- 
spondence Analysis. Unlike the current sentiment analysis techniques, the proposed 
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Table 2 The factor-clusters matrix. 
Factor 1 Factor 2 Factor 3 Factor 4 


> = = changing sisi ten , A er 
G _ -0,2485 0,2177 -0,5145 Problems related to the political power unable to drive the changing. 
g hopes power changing — The success is related to the economic power represented by A. Merkel 
72 20,5119 0,6132 0,2352 0,0558 and M. Draghi. 
ææ alert - changing injustice , SR 7 Di i 
63 0,3088 -0,1367 0,2492 0,3808 Alert generated by the changing related to balance of powers. 
g alert - changing injustice The idea about the union with the social power of foreign countries 
75 0,4653 0,3534 -0,5362 -0,1396 because of the loss of identity. 
€, hopes problems product - The European genesis has a cost that causes problems: economic 
75 -0,5718 -0,4382 -0,4911 0,2101 request for help and the rejection to give. 
CLUSTER 
Ke Fact 4 (22%); Y = Fact 2 (25,14%) 
i G A 
Fig.2 The map of the Europe. xan | 


methodology takes into account the conversations on social networks like structured 
corpora, in which the relationships between words can be described beyond the eval- 
uative bias (positive/negative or agree/disagree), giving rise to a dense structure of 
meaning. Results show clearly how the textual content is structured according to the 
different emotional expressions. 
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Differential Interval-Wise Testing for the 
Inferential Analysis of Tongue Profiles 


Test Intervallare Differenziale per Vanalisi Inferenziale 
di Profili Linguali 


Alessia Pini, Lorenzo Spreafico, Simone Vantini and Alessandro Vietti 


Abstract Motivated by the functional data analysis of a data set of tongue profiles, 
we describe in this talk the differential interval-wise testing (D-IWT), i.e., a local 
non-parametric inferential technique for testing the distributional equality of two 
samples of functional data. The described method can impute significant differences 
between the two samples to specific intervals of the domain and to specific orders of 
differentiation. D-IWT based inference provides a highly informative and detailed 
representation of the regions of the tongue where a significant difference between 
manners of articulation is located. 

Abstract Motivati dall’analisi di un data set funzionale di profili linguali, descriv- 
iamo un test intervallare differenziale (differential interval-wise testing o D-IWT), 
una tecnica inferenziale non parametrica locale per testare l'uguaglianza in dis- 
tribuzione di due campioni di dati funzionali. il metodo descritto permette di im- 
putare le eventuali differenze significative tra i due campioni a specifici intervalli 
del dominio e a specifici ordini di differenziazione. L’inferenza ottenuta tramite il 
D-IWT fornisce una rappresentazione altamente informativa e dettagliata delle re- 
gioni della lingua che presentano una differenza significativa tra diversi modi di 
articolazione. 


Key words: Functional data analysis, Derivatives, Non-parametric inference, Local 
inference, Articulatory phonetics 
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1 Introduction 


Speech sounds are produced with a mechanism involving three main steps. The 
lungs pump an airflow to vibrate the vocal folds; then the vocal folds generate au- 
dible pulses; finally, pulses are “fine tuned” by the speech organs (e.g., tongue, lips, 
palate). The tongue plays a central role in this final step, and it is involved in the pro- 
duction of most of speech sounds in the world languages. In fact, it is very flexible, 
precise, and fast. This work aims at describing a statistical comprehensive approach 
to infer if and how the tongue position and shape change while different sounds are 
pronounced by the same speaker. The analysis focuses on the statistical comparison 
of tongue profiles corresponding to different manners of articulation the /R/ sound 
in the Tyrolean dialect, i.e., a German dialect spoken in South Tirol (Italy). 

The comparison between the groups of curves can be naturally embedded within 
the framework of functional data analysis [12, 4, 6]. The literature dealing with 
inference of functional data has pursued different approaches. Most of them are 
global, i.e., they provide the analyst with a “simple” rejection or non-rejection of 
the null hypothesis (e.g.,[5, 2, 3, 6]). Recently, some local methods have been pro- 
posed, providing the analyst with portions of the domain where the null hypothesis 
is rejected or not rejected (e.g., [1, 9, 14, 10]). 

In this work we present an overview of a non-parametric local method, that is 
the differential interval-wise testing (D-IWT), described in detail in [11]. The D- 
IWT is a technique that tests differences between groups of functional data jointly 
taking into account the curves and their derivatives. Its output is an adjusted p-value 
function for each explored derivative order that can be used to select intervals of the 
domain imputable for the rejection of a null hypothesis. 


2 Methodology 


Assume to observe two independent samples of functional data é;;, j = 1,2, i= 
1,...,n; embedded in the Sobolev space H 4(T) of all real-valued squared-integrable 


functions on the domain T with squared-integrable derivatives up to order d > 1 
iid €,, where €, and È, are two independent random elements of H“(T). We aim 
at performing the following family of tests, each focusing on a specific order of 
differentiation k =0,...,d: 


Ho : €, £ €, against Hk : E[D*é |] 4 E[D*E,]. (1) 


The outputs of the D-IWT are: 


e d+1 partial adjusted p-value functions px : T + [0,1], one for each order 
of differentiation, for testing separately the partial hypotheses (1), computed by 
applying the interval-wise testing [10] to every test of the family (1). For every 
k, the p-value Di of the restriction of test (1) on every interval of the domain 
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I CT is computed by means of a non-parametric permutation test [8]. The 
adjusted p-value function Dpr (1) of order k = 0,...,d is defined as: 


Dpi(1) = sup pfe- (2) 


Jt 


e d+1multi-derivative adjusted p-value functions p px : T — [0, 1], one for each 
order of differentiation, for testing jointly the partial hypotheses (1), computed 
by adjusting the d+ 1 partial p-value functions pp. (1) by means of a closed test- 
ing procedure [7]. In detail, for all possible combinations of differentiation orders 
indexed by k = {kj ,ko,...,kg} with Vq: ky € {0,...,d} and Q € {2,...,d+ 1}, 
a Q-variate IWT is performed by means of permutation tests based on Sobolev 
norms (or semi-norms) on the corresponding orders of differentiation. The ad- 
justed p-value functions Ppx(t) are computed according to formula (2) based 


on the obtained p-values of multi-derivative tests. Finally, the d+ 1 adjusted 
multi-aspect p-value functions Ppk (t) are calculated by taking for each order of 
differentiation the point-wise maximum of all adjusted p-value functions Ppx (t) 
involving that order: 


Ppt (1) = sup Ppr (t) (3) 
k5k 


The p-value functions Ppt (t) can be thresholded at level œ to select the intervals 
of the domain presenting significant differences between the two populations on the 
kth order of differentiation. The properties of the D-IWT in terms of control of the 
family-wise error rate and consistency are proven in [11]. 


3 Data Analysis 


To better understand the potential of the D-IWT in the practice, we summarize the 
results of the analysis of tongue profiles. The aim of the analysis is to test the differ- 
ences between five different manners of articulating the uvular /R/: 


f FRICATIVE: produced by constricting airflow through a narrow channel at the 
place of articulation. There is contact between tongue and palate. 

a APPROXIMANT: shares with f the way of transmission of the sound, although 
there is no contact between tongue and palate. 

r TRILL: produced by directing air over the tongue so that it vibrates. There is 
contact between tongue and palate. 

t TAP: produced with a single contraction of the muscles so that the tongue, is 
thrown against the palate. There is contact between tongue and palate. 

voc VOCALIZATION: the airstream proceeds along the sides of the tongue but is 
blocked by the tongue from going through the middle of the mouth. There is no 
contact between tongue and palate. 
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Fig. 1 Scatter matrix of pairwise differences between the five groups. Groups are identified in 
the diagonal. For each couple of groups: the upper-diagonal box indicates the two sample means 
(upper part) and the significant intervals at 5% level (lower part); the lower diagonal box indicates 


the three multi-aspect adjusted p-value functions Ppo(t)s Ppl (t), and Pp (t). 


Data were collected by ultrasound imaging techniques at the Alpine Laboratory 
of Phonetic Sciences and Phonology of the Free University of Bozen - Bolzano, 
Italy. For a detailed description of the data set, see [13]. The functional data have 
been obtained by a penalized B-spline smoothing of order six. The penalization 
parameter was computed via generalized cross-validation criterion [12]. 

We perform a D-IWT-based analysis of tongue profiles, in order to identify the 
possible pairwise differences between the five variants in the curves and the first two 
orders of differentiation (d = 2). Figure | displays the results. For each comparison, 
the upper diagonal plots show the two sample mean curves and the lower diagonal 
panel shows the three multi-derivative adjusted p-value functions. The three bars 
in the lower part of each upper diagonal plot indicate the intervals with associated 
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adjusted p-value lower than 5%. The color of the bars is consistent with the one of 
the three adjusted p-value functions. 

Inference in terms of D-IWT provides a highly informative and detailed repre- 
sentation of the regions of the tongue where a significant difference is located. As 
expected, we observe more pronounced differences when comparing a or voc (pro- 
duced without touching the palate) with f, t, or r (produced by touching the palate), 
while there are less pronounced differences when comparing a with voc and when 
comparing two variants of the group f, t, and r. For instance, there are no signif- 
icant difference between trill /R/ (r) and tap /R/ (f) in all orders of differentiation. 
Conversely, at € = 1%, approximant /R/ (a) and fricative /R/ (f) (second panel of 
the first row) are pointed out as not identically distributed. Fricative /R/ is produced 
by touching the palate, while approximant /R/ is produced without touching the 
palate. Coherently, we observed significant differences in vertical position between 
the two variants, with f reaching higher vertical positions than a. In addition, having 
a lower degree of constriction, approximant /R/ has a lower slope in the back part 
of the tongue. A more detailed analysis of the results, as well as a simulation study 
assessing the performances of the D-IWT can be found in [11]. 
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Hotelling meets Hilbert: inference on the mean 
in functional Hilbert spaces 


Un incontro con Hotelling e Hilbert: inferenza per la 
media in spazi di Hilbert funzionali 


Alessia Pini, Aymeric Stamm, Simone Vantini 


Abstract The talk will focus on the problem of finite-sample null hypothesis sig- 
nificance testing on the mean element of a random variable that takes value in a 
generic separable Hilbert space. For this purpose, we will present a definition of 
Hotelling’s T? statistic that naturally expands to any separable Hilbert space. In de- 
tail, after having recalled the notion of Gelfand-Pettis integral in separable Hilbert 
spaces and introduced the definition of random variables in Hilbert spaces, and the 
derived concepts of mean and covariance in such spaces, we will present a unified 
framework for making inference on the mean element of Hilbert populations based 
on Hotelling’s T? statistic, using a permutation-based testing procedure. We will 
then present the theoretical properties of the procedure (i.e., finite-sample exactness 
and consistency) and show the explicit form of Hotelling’s T? statistic in the case of 
some famous spaces used in functional data analysis like Sobolev and Bayes spaces. 
We will finally demonstrate the importance of the space into which one decides to 
embed the data by means of simulations and a case study. 

Abstract La relazione vertera sul problema della verifica delle ipotesi per campi- 
oni finiti relativamente all’elemento medio di una variabile aleatoria a valori in un 
generico spazio di Hilbert separabile. A questo scopo, presenteremo una definizione 
del T? di Hotelling che lo generalizza naturalmente a qualsiasi spazio di Hilbert 
separabile. In particolare, dopo aver ricordato la nozione di integrale di Gelfand- 
Pettis in un generico spazio di Hilbert separabile e introdotto la definizione di vari- 
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abili aleatoria a valori in un generico spazio di Hilbert e contestualmente anche 
i concetti derivati di media e covarianza in tali spazi, presenteremo un approccio 
per fare inferenza permutazionale sull’elemento medio di popolazioni Hilbertiane 
basato sul T? di Hotelling. Approfondiremo in seguito le proprietà teoriche della 
procedura (ovvero l’esattezza per campioni finiti e la consistenza) e mostreremo 
la forma esplicita assunta dal T? di Hotelling nel caso di alcuni spazi frequente- 
mente utilizzati nell’ ambito dell’analisi di dati funzionali: gli spazi di Sobolev e di 
Bayes . Concluderemo mostrando (per mezzo di simulazioni e di un caso di studio) 
l’importanza dello spazio in cui vengono rappresentati i dati funzionali. 


Key words: Functional Data Analysis, Object-oriented Data Analysis, Null hy- 
pothesis testing 


1 Motivation 


Statisticians are more and more confronted with the analysis of complex data, where 
complexity often take the form of a data analysis which pertains to analyzing data 
that are represented with abstract mathematical constructs, often belonging to some 
space on which a Hilbert structure is assumed (i.e., Object-oriented Data Analy- 
sis, OODA, [12, 14]). For example, the advent and development of technologies 
able to capture high-frequency measurements has provided the statistician with data 
that can be viewed as functions which are the foundations of functional data anal- 
ysis (FDA [17, 7]). While FDA and OODA are expending rapidly, the theoretical 
study of statistical tools for making inference in such spaces is still a lively area of 
methodological investigation ([6, 20, 5, 19, 1, 9, 3, 4, 10, 18, 2, 8, 13, 15]). 


2 Talk Outline 


The talk will focus on the inferential problem of constructing a statistical test for 
the means of random variables belonging to Hilbert spaces of possibly infinite di- 
mension. After having recalled the notion of Gelfand-Pettis integral in separable 
Hilbert spaces ([11]) and introduced the definition of random variables in Hilbert 
spaces, and the derived concepts of mean and covariance in such spaces, starting 
with Hotelling’s T? statistic widely used in multivariate data analysis for testing 
the mean, we will show that Hotelling’s T? statistic can be coherently defined in 
any Hilbert space independently from its dimensionality and the sample size. We 
will then discuss all the theoretical properties pertaining to Hotelling’s 7? statis- 
tic as hereby defined, its explicit form in the case of some famous spaces used in 
functional data analysis like Sobolev spaces ([17, 7])and Bayes spaces ([?]), and 
the development of new null hypothesis significance testing procedures for making 
inference on the mean element in Hilbert spaces based on this statistic. We will con- 
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clude presenting a theoretical and empirical comparison on simulated and real data 
with other state-of-the-art procedures and discussing the importance of the space 
into which one decides to embed the data by means of simulations and a case study. 
The presented work is fully detailed in [16]. 
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Accounting for measurement error in small area 
models: a study on generosity. 


Modelli per piccola area con errore di misurazione: uno 
studio sulla generosita 


Silvia Polettini and Serena Arima 


Abstract In this paper we focus on a recently documented effect of economic in- 
equality, namely that higher income individuals tend to be less generous than poorer 
individuals, but only in contexts where macro-level economic inequality is high, or 
is perceived as high. We consider data from the Measuring Morality study, a na- 
tionally representative survey of United States residents, that contains a validated 
behavioural measure of generosity (the dictator game) along with the household 
income of respondents. We fit a small area model to this data with the aim of inves- 
tigating the role of economic inequality on generosity in the US. We observe that 
model covariates (reported income and Gini index) are subject to measurement error 
and investigate the effect of introducing the measurement error in this model. 
Abstract I! lavoro considera il ruolo della disuguaglianza economica sulla gen- 
erosità, a partire da uno studio recente secondo cui gli individui con redditi più 
elevati tendono ad essere meno generosi degli individui meno abbienti, ma solo 
in contesti di grande disuguaglianza economica. I dati analizzati provengono dal 
Measuring Morality study, un’indagine effettuata negli USA in cui viene rilevato 
il reddito e una misura validata di generosità (dictator game). Per ogni area di 
residenza è stato anche ricavato l’indice di Gini, come misura di disuguaglianza 
economica. In questo lavoro si stima la generosità mediante un modello per piccole 
aree con reddito e disuguaglianza come variabili ausiliarie. Il modello viene esteso 
al fine di considerare l’errore di misurazione nelle variabili ausiliarie, sia continue 
che discrete. 


Key words: small area estimation, measurement error, misclassification, Bayesian 
inference. 
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1 Introduction 


There is an increasing interest in understanding the implications of income for be- 
haviour, in particular generosity toward others. Well grounded literature on this topic 
has portrayed a picture of higher-income individuals as consistently more selfish 
than poorer individuals [13]. A different perspective is reported in a recent paper 
[6], where the relationship between economic inequality, income, and generosity 
is tested. Analysing data from the Measuring Morality study (a nationally repre- 
sentative survey of United States residents), as well as a follow-up experiment, the 
authors identify a previously undocumented effect of economic inequality, namely 
that higher income individuals in the US tend to be less generous than poorer in- 
dividuals, but only in contexts where macro-level economic inequality is high, or 
is perceived as high. The Authors comment that the results obtained challenge the 
prevailing view in the literature that higher income individuals are necessarily less 
generous and conclude that “inequitable resource distributions undermine collective 
welfare” and that redistributive policies may “attenuate, or even reverse, the nega- 
tive relationship between income and generosity, in turn increasing the generosity 
of those individuals who have the most to give”. 

The Measuring Morality study data contain a validated behavioural measure of 
generosity (the dictator game) along with the household income of respondents; 
moreover, Gini indices were available from the American Community Survey. The 
authors fit a mixed effects model to these data, where significant, negative, inter- 
action between income and inequality is found. Using a Bayesian approach, we 
consider the same model, in a small area context and speculate on the fact that both 
income and the Gini index are subject to measurement error for different reasons: 
indeed income is self reported and the Gini index is estimated from another survey. 
As stressed in the literature, ignoring the measurement error in the covariates may 
lead to inconsistent estimates and can severely invalidate inferences. 

The paper is organized as follows: in Section 2 we introduce the problem of mea- 
surement error in small area estimation and propose a small area model accounting 
for measurement error in covariates and present. In Section 3 we present and discuss 
the results obtained when the model is applied to the generosity data. 


2 A measurement error small area model for generosity data 


In this paper, we focus on unit level small area models, whithin a Bayesian frame- 
work. Unit level small area models relate the unit values of the study variable to 
unit-specific auxiliary variables with known area means. See [11] for an up-to-date 
review. 

Suppose there are m areas and let N; be the known population size of area i. 
We denote by Yj; the response of the j—th unit in the i-th area (i= 1,...,m; j = 
1,..., Ni). A random sample of size n; is drawn from the i—th area. The goal is 
to predict the small area means Y; = N. A yij, i= 1,...,m, based on the available 
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data. To develop reliable estimates, auxiliary information is introduced as covariates 
and usually a mixed effects model is specified as 


Yi; = æ+ Pwij+ uit ej talmi ofS TaN; (1) 
s ; iid iid 
with &;; and u; independent, &;; ~ N(0,02) and u; ~ N(0, 62). [8] and [9] were the 
first to consider the problem of measurement error in small area models for unit- 
level data. They assume that the true, area-level, covariate, w;, is measured with 
error as 


Sij=W+TMj MX NO, o) | i=1,..,m; j=l Q) 


where £;j, u; and 7;; are taken mutually independent. [8] also assumed that w; K 
N(4w,0}), defining the structural measurement error model. They considered both 
an empirical Bayes and a hierarchical Bayes approach to derive predictors of small 
area means 6;. [12] extended the approach in [8] including sample information on 
the covariate values. [8] also proposed a fully Bayesian approach, by specifying a hi- 
erarchical model, with vague prior distributions for all the model parameters, whose 
posterior distributions are estimated via Gibbs sampling. [1, 3] extended the above 
approach, proposing to use the Jeffreys’ prior on the model parameters. The afore- 
mentioned literature considers the case in which the measurement error only affects 
continuous variables, according to the measurement error model of equation (1). 
For discrete covariates, measurement error means misclassification. To allow for 
auxiliary discrete covariates measured with error, [4] propose to model the misclas- 
sification mechanism through an unknown transition matrix P and estimate all the 
unknown parameters in a fully Bayesian framework. Following [4], for each unit in 
each area, we consider the following covariates: t;; — the vector of p continuous or 
discrete covariates measured without error, w; and x;; — respectively, a vector of q 
continuous covariates and h discrete variables (with a total of K categories), both 
measured with error. Denote by s;; and z;; the observed values of the latent w; and 
Xij, respectively. Without loss of generality, in what follows we assume h = 1. 
Following the notation in [8], the proposed measurement error model can be writ- 
ten in the usual multi-stage way: for j=1,...,nj,i=1,...,mand for k,k' =1,..., K 


e Stage 1. Vij = Gij + i; eij ŻN(0, 0°) 

e Stage 2. 0;;= tô +wiyt Li, I(x;j = K)Bx+u; uN (0, 02) 

e Stage3. Sk; N(w;, £s = diag(0 la 0;,)) w{N (0,5, = diag(0%, su Or) 
Pr(Zij = k|Xij = K) = Prk Py. Dir( 1 Ort Qy K) Pr(Xij = k’) = x 

e Stage 4. B, ô, Y, 02, 07, Oe nia o, are, loosely speaking, a-priori mutually inde- 
pendent. 


Stage 3 defines the measurement error model for both continuous and discrete 
covariates. For the discrete covariates, the misclassification mechanism is specified 
according to the K x K matrix P, whose (k’,k) element, pyg, denotes the probability 
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that the observable variable Z;; takes the k—th category when the true unobservable 
variable X;; takes the k’—th category. We also assume that the misclassification prob- 
abilities are the same across subjects and that all the categories have the same prior 
probability x to occur. Over each row of P , we place a Dirichlet Dir(ay 1,...,0w,x) 
prior distribution, with known oy 1,..., 0% x. In Stage 4 we assume Normal priors 
for B, 6, and y and inverse gamma distributions for 02 and 0? and o2. Hyperparam- 
eters have been chosen to have flat priors. Finally, we fix Ly and (0w,1,...,0W g). 
According to the above assumptions, we can estimate the transition matrix P and 
the measurement error variance 02 jointly with all the other model parameters. As 
the posterior distribution cannot be derived analytically in closed form, we obtain 
samples from the posterior distribution using Gibbs sampling. 


3 Results and conclusions 


We fit a unit level small area model with measurement error in covariates, which 
also allows us to evaluate the relationship between economic inequality, income 
and generosity. We use data from the Measuring Morality study, a nationally repre- 
sentative survey of United States residents consisting of a sample of 1498 respon- 
dents in the US. For each respondent, income and some personal and demographic 
variables (such as age, gender, education, ...) have been collected. Respondents 
completed a validated behavioural measure of generosity: the dictator game. Re- 
spondents learned that they had been randomly assigned the role of decider and 
had received 10 tickets, each worth one entry in a raffle to win a monetary prize 
of either 10 or 500. They could transfer any number of tickets to the next par- 
ticipant, a receiver who did not have any tickets. By giving tickets, respondents 
could benefit another person at a cost to themselves in a zero-sum opportunity to 
win money. This measure of generosity was administered to individuals with dif- 
ferent incomes residing in areas (US states plus the District of Columbia) that vary 
in levels of inequality, measured according to the Gini’s coefficient. The number 
of respondents in each area (m = 9 divisions) ranges from 72 to 286. In the pro- 
posed model we take generosity as the response variable and income, standardized 
Gini coefficients and their interaction as auxiliary variables. According to the sur- 
vey design, household income was collected as a 19-classes variable; for ease of 
interpretation in the application we recoded it into five classes (C4 : less than 12500; 
Ca : [12500, 30000), C3 : (30000, 60000], C4 : (60000, 125000], Cs : over 125000). 
Since income is self reported and the Gini index is estimated using data from the 
2012 American Community survey, we can suspect that both auxiliary variables are 
subject to measurement error. In order to evaluate the impact of accounting for this 
source of error, we fit both the standard model that ignores the measurement error 
and the model proposed in Section 2. Figure 1 shows the posterior distribution of 
the model parameters. The left panel reports the posterior distribution of the regres- 
sion parameters under the proposed measurement error model: income is the only 
factor that significantly impacts on the response variable, since for all the other pa- 
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rameters the 95% credible intervals contain the zero value (CI gin; : [—0.207, 0.349], 
Cle 1 «Gini 5 [-0.632,0.241], Clo2xGini a [-0.542,0.217], CIC34Gini $ [-0.533,0.189], 
Clc4xGini : [-0.827, —0.028]). With respect to the income, it is apparent that gen- 
erosity increases with income, with the exception of the last class, in which the ef- 
fect on generosity is comparable to that of the second one. This actually means that 
the richest are less generous with respect to the others, which is line with findings 
in the mainstream literature on the subject. On the other hand, when one ignores the 
measurement error, all the covariates and their interactions seem to be significant 
(Figure 1, right panel). In particular, income exhibits a positive effect on generosity, 
with no distinctions between income classes, which contradicts the economic theo- 
ries; moreover, an unexpectedly positive effect of inequality is found. With respect 
to the measurement error for income, the posterior distribution of Pı is concen- 
trated around 0.5 and almost uniformly distributed over the other categories. This is 
an empirical evidence that income is often underreported by the respondents. The 
distributions of the other diagonal elements of P are concentrated around 0.9 and 
credibile intervals do not contain 1. We conclude that measurement error has a sig- 
nificant impact on income. The small area estimates produced under the model with 
and without measurement error are reported in Table 1. As can be seen, allowing 
for measurement error in both continuous and categorical covariates also impacts 
on estimation of the small area means in both point estimates (in particular for the 
first division, which is one of the smallest ones) and measures of uncertainty. Also, 
although the posterior means are not very different for the large areas, the ranking of 
the divisions varies. As can be seen, allowing for measurement error in both contin- 
uous and categorical covariates also impacts on estimation of the small area means. 
Although the posterior means are not very different, the ranking of the divisions 
varies. In conclusion, our application reveals that ignoring the measurement error in 
covariates may drive inferences and yeld misleading conclusions. 


Table 1 Small area estimates: posterior means of the small area means obtained with the model 
that does not account for the measurement error (first row) and the model that accounts for it 
(second row). Standard deviations in brackets. 


Division I 2 3 4 5 6 7 8 9 

voe, 417 4.11 425 444 419 428 425 437 422 
(0.27) (0.33) (0.18) (0.20) (0.24) (0.10) (0.14) (0.16) (0.23) 

Orr 427 4.09 426 443 4.17 430 425 4.38 4.23 
(0.36) (0.41) (0.38) (0.37) (0.40) (0.33) (0.34) (0.32) (0.40) 
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Fig. 1 Posterior distribution of the model parameters. Left panel: posterior distributions obtained 
from the proposed model. Right panel: posterior distributions from the model that ignores the 
measurement error. 
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Structural changes in the employment 
composition and wage inequality: 
A comparison across European countries 


Cambiamenti della struttura occupazionale e 
disuguaglianza salariale: Un confronto a livello europeo 


Gennaro Punzo and Mariateresa Ciommi 


Abstract For several years many countries have been experienced a progressive 
impoverishment of middle-skill jobs that has led to structural changes in their labour 
markets (job polarisation, upgrading or downgrading of occupations). This paper 
investigates how the shifts in the workforce affect wage inequality comparatively for 
a selection of European countries. The RIF regression, tested on the EU-SILC data 
(2005-2013), enables us to assess how much of inequality differentials over time is 
accounted for by workers’ endowments rather than the capability of country’s labour 
market to capitalise skills. An outright deterioration of all jobs, irrespective of skill 
levels required, and the lack of a well-defined structure of the labour market may 
jeopardise wage distribution and the return effect plays a leading role in this process. 


Abstract Molti paesi stanno assistendo ad un‘alterazione nella composizione della 
propria forza lavoro in seguito ad una progressiva diminuzione delle occupazioni 
con livelli intermedi di competenze. Alla luce di questi cambiamenti strutturali, si 
propone un’analisi comparativa della disuguaglianza salariale in Europa. La 
metodologia RIF, applicata a dati EU-SILC (2005-2013), permette di scorporare, 
dai differenziali di disuguaglianza, la quota imputabile alle dotazioni dei lavoratori 
da quella attribuibile alla capacità dei mercati di valorizzare tali risorse. L’assenza 
di una struttura di mercato ben definita, cui spesso si associa un deterioramento di 
tutte le occupazioni, può incidere seriamente sulla distribuzione salariale e, in tale 
processo, la componente “ritorno” riveste un ruolo fondamentale. 


Key words: Labour market, wage inequality, European countries, RIF regression 
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1. Introduction 


A basic prerequisite of the Kuznets theory holds that inequality tends to decline with 
the economic progress [13]. Hence, substantial changes of global macroeconomic 
environment create a general inequality climate for both developed and developing 
countries (Galbraith and Kum, 2005). For instance, the US income distribution 
suffered a hard shock during the Great Depression of the 1930s and the Second 
World War (1939-1945) with permanent fallouts in the years ahead. The US income 
inequality was still comparatively high in the 1970s and continued to grow until the 
US has reached the top of the rich country inequality pyramid [12]. 

The ongoing global crisis — the worst since 1930 — has produced painful effects 
for most Europe, especially for countries with weaker economies. As detailed by 
Eurostat (on-line database), the Eurozone unemployment increased from 7.5% to 
11.3% between 2007 and 2013, while Mediterranean and Central/Eastern European 
countries were affected by unemployment more severely [14]. It is for these 
emergencies that at least three of the goals of Europe 2020 strategy for smart, 
sustainable and inclusive growth relate directly to employment, productivity and 
inequality. With the purpose of reaching the employment rate of 75% for 20-64- 
year-olds, increasing at least 40% of 30-34-year-olds completing tertiary education 
and lifting 20 million people out of poverty by 2020, the strategy focuses on the 
target of “new skills for new jobs” taking the headline idea of “more and better jobs” 
from the earlier Lisbon agenda. 

However, within the same country, workers with varying levels of skills suffered 
at different extent and intensity. In particular, as discussed by Eurofound [5], the 
relatively recent trends identified major declines of the demand for jobs in the 
middle of skills hierarchy. This has resulted in structural shifts in the composition of 
labour force that give rise to varying labour market outcomes and income inequality 
trajectories [2]. In other words, changes in income inequality may be contextualised 
in the structure of the country’s labour market in terms of job polarisation, 
upgrading or downgrading of occupations. Specifically, job polarisation consists of a 
relative expansion in the demand of jobs occupying the top and bottom of the skills 
hierarchy and shrinking of the jobs in the middle, while the upgrading favours high- 
qualified activities with respect to low- and middle-skill jobs [1,10]. More rarely, 
low-skilled jobs grow faster than the rest, leading to downgrading of occupations 
[11]. 

In this field, the paper aims at identifying regularities in the structural shifts in 
the labour market comparatively for ten European countries and their potential 
relationships with the changes in wage distribution. Borrowing the geographical 
classification by Nolan et al. [14], which approximately corresponds to the standard 
welfare regimes typology [4], the following countries were selected: 

1) The “Big Three” of Europe: France, Germany, and the United Kingdom 

2) The four Mediterranean countries: Italy, Greece, Portugal, and Spain 

3) Three Central/Eastern countries: Czech Republic, Hungary, and Poland. 

The Recentered Influence Function (RIF) regression [7,8] allows: i) exploring the 
primary driving forces of wage inequality, ii) decomposing inequality gaps into the 
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effect attributable to individuals’ characteristics (endowment) and the effect due to 
the capability of country labour market to valorise workers’ characteristics (return), 
iii) splitting these components to evaluate the contribution that each factor gives to 
inequality change. 


2. Changes in the countries’ labour markets 


This Section describes how the structure of employment changed between 2005 and 
2013 in the selected European countries and how these shifts produced varying 
patterns in their labour markets. The choice of 2005 and 2013 as the reference years 
allows us to obtain clues about the socio-economic scenarios that foreshadowed the 
global crisis and their role in affecting the structure of the country’s labour markets 
and patterns of wage inequality. 

The data are from the EU-SILC (European Union-Survey on Income and Living 
Conditions), which is currently the main European reference source for comparable 
socio-economic statistics at both the household and individual levels. Moving from 
the assumption that inequality starts in the labour market, changes in the wage 
distribution become the key factors behind inequality trends. Therefore, our analysis 
focuses on employees, aged 16-64, irrespective of their activity sector, excluding 
those employed in military occupations. They are classified in the three distinct 
groups of high-, middle- and low-skilled employees based on the level of expertise 
required to perform their specific job. Given the strong correlation between the 
current average education level and skills required to perform that job (Eurostat, 
2010), the average level of education is selected as a measure of the skills needed. 

Figure 1 shows the percentage changes in the composition of employment shares 
between 2005 and 2013 for each of the three broad categories of employees by skills 
level in the selected countries. The results allow the countries to be classified 
according to the patterns of the labour market sketched over time in terms of job 
polarisation, the upgrading of occupations or neither of the two. 
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Figure 1: Percent changes 2005-2013 in employment shares by skill levels by country 
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The French, British and Portuguese labour markets are mostly characterised by 
the upgrading of occupations. They share a growth in professions that demand high 
skills and the simultaneous contraction of low- and middle-skill activities. In France, 
the above-mentioned jobs have both decreased by 32%, whereas the share of high- 
skill workers has increased by 27%. The United Kingdom follows a similar trend 
albeit with less intensity. To forehead of a reduction of low- and middle-skill jobs of 
about one-quarter, Portugal shows the largest proliferation of high-skill activities. 

In Poland, instead, the drastic reduction in the demand for low- (-71%) and 
middle-skill (-48%) jobs is opposed only a slow-growing of highly specialised jobs 
(+2%). One specific point deserves the Czech Republic where there has been the 
simultaneous growth of all jobs regardless of the level of skills required, and 
surprisingly, the demand for high-skill jobs has practically doubled (+98%). In brief, 
the structural changes in the Polish and Czech labour markets provide evidence of 
two patterns that can potentially evolve in the future but, at present, are relatively 
upgraded. 

In Germany, middle-skill jobs have declined as a share of employment by about 
40 percent with slightly increasing levels of high-skill occupations. Instead, in 
Hungary, the small decrease of middle-skill jobs goes together with an important 
expansion of jobs for employees at the high (+47%) and low (+23%) end of the skill 
spectrum. Similarly, Greece has seen a large increase in the share of its low- and 
high-skilled employees (+40% and 57%, respectively) and the shrinkage of middle- 
skill jobs by 6%. Accordingly, if the patterns of the Hungarian and Greek labour 
markets may be classified as purely polarised, the German is however relatively 
polarised. 

Finally, as regards Italy and Spain, changes in 2005-2013 do not enable us to 
determine whether one phenomenon prevails over the other. More precisely, it is not 
possible to define which structure succeeds because the share of employees has 
decreased for each of the three groups, in contrast to both job polarisation, where 
only the middle-skill jobs fall, and the upgrading of occupations, which comprises a 
decrease in the share of low- and middle-skill jobs with a simultaneously growth in 
high-skill activities. In both countries, the strong deterioration in the employment 
structures, even more severe for Italy, sketches hybrid patterns of their labour 
markets. However, both the Italian and Spanish high-skilled employees suffer 
relatively smaller declines than their low- and middle-skill counterparts. 


3. RIF decomposition 


The Recentered Influence Function regression [7,8] of Gini on (log of) gross 
individual wage replaces the log-wage as the dependent variable with the recentered 
influence function of the Gini coefficient v(F) and directly estimates the impact of 
the explanatory variables on Gini. First, the RIF methodology allows the exploration 
of the primary driving forces of the inequality-generating process by country. 
Second, the overall Gini change between 2005 and 2013 is decomposed by country 
into the endowment and return effects. Third, the latter two components are 
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computed for each covariate to identify the factors that are quantitatively more 
significant to make inequality differentials over time. 

The RIF approach overcomes the two main limitations of the Oaxaca-Blinder 
method: i) the estimations of the return and endowment effects can be misleading if 
the linear model is unspecified; ii) the contribution of each covariate to the return 
effect is highly sensitive to the choice of the base group. Moreover, the Oaxaca- 
Blinder method enables the decomposition to be applied only to the mean, while the 
RIF approach also allows the decomposition of Gini (or median, quantiles, and 
variance). The Juhn, Murphy and Pierce method and the quantile-based 
decomposition by Machado and Mata overcome these drawbacks, but they are 
unable to trace the contribution provided by each covariate to the endowment effect, 
whenever they are used to compute the decomposition for various distributional 
statistics [7]. 

The observed wage (Y;) can be written without imposing a specific functional 
form considering the wage determination function of observed components X; and 
some unobserved components £;i: 


Yi = fa(Xiei). for g=0,1 (1) 


g = 1 for workers observed in group 1 and g = 0 for those in group 0. In this work, 
the two groups are composed of employees at time 2005 and 2013. 

Let v(F,) be the generic distributional statistic to study (in this work, Gini), the 
first-order directional derivative is known as its influence function IF(y,v) so that it 
measures the relative effect of a small perturbation in the underlying outcome 
distribution on the statistic of interest. The recentered influence function (RIF) is: 


RIF(Y;v) =IF(¥;v) + v (2) 


The unconditional expectation of the RIF(y,v) can be modelled as a linear function 
of the covariates: 


E[RIF(Y; v)|X] = Xy + € (3) 


the parameters y , which are the marginal effect of X on v, can be estimated by OLS. 
As regards the Gini coefficient, the distributional statistic v is defined as: 


vC (F) = 1 — 24°!R(Fy) (4) 


where R(F,) =f x GL(p(y); Fy)dp with p(y) = Fy(y) and the Generalised Lorenz 


rog * 
ordinate of Fy is given by GL(p(y); Fy) = JE, ®) zar, (z). As demonstrated by 


Firpo et al. (2007), the recentered influence function of Gini can be rewritten as: 
RIF (y; 0°) = 1 + 2u?R(F) - 2u -po + GLOR] O 


The Gini (v°°) gap between the periods 0 and 1 is decomposed as: 
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A GE ca Px n = ge ayGC | 25 GC 
AY = (9,60 —Poyac) + (Xi — Xo)Po,,6c = AY + AX (6) 


The overall inequality gap a") is disentangled into the return effect (ar) 


and the endowment effect ar"). The first term corresponds to the effect on v of 
a change from fiC) to fo(,") while keeping the distribution of (X,£)|G = 1 
constant. Conversely, the endowment effect keeps the return effect fo(-,-) constant 
and measures the effect of changes from (X,¢)|G =1 to (X,e)|G = 0. Further 
methodological details on the contribution of a single covariate in the decomposition 
can be found in Firpo et al. [7] and Fortin et al. [8]. However, the key term for 
decomposing v° is the counterfactual distributional statistic v&°, which is the 
distributional statistic that would have prevailed if the workers observed in group 1 
had the return effect of period 0. Using the counterfactual distribution, the above 
mentioned components can be rewritten as: 


PN E aks R ayGl La SAA 
AS = Xi Pivec — Pavec) and Â% = (Xo — Xo) o,vec (7) 


The estimation of the coefficients, y,, y2, and y,, requires first estimating the 
weighting functions w,(G), wo(G) and w,(G,X). In order to have weights summing 
up to one, the normalisation procedure is used (details in DiNardo et al. [3] and 
Firpo et al. [7]). 


4. Main results 


Once the RIF regressions of Gini on log-wage are estimated to explore primary 
factors (individual, human capital, and job-related) that drive the observed wage 
inequality by country, the overall Gini differences in 2005-2013 are disentangled 
into the endowment (composition effect) and return effects (wage structure) (Tables 
1-3). The composition effect assesses the portion of Gini change attributable to the 
employees’ endowments. The wage structure explores the capability of the country’s 
labour market to transform individual skills into job opportunities and earnings and 
explains why employees with the same individual characteristics are rewarded 
differently. Standard errors of components are computed according to the method 
detailed in Fortin et al. [8]. 

The overall Gini has declined over time for France and Germany as part of the 
most developed European economies even though the magnitude of the fall has been 
more pronounced for the transition countries of Central/Eastern Europe. Conversely, 
the United Kingdom and Italy (that together with Germany and France form the 
“Big Four” of Europe) show a rise in the overall inequality, which is however far 
larger for Italy. In line with the literature [14], Italy is an equal country if compared 
to others with similar patterns of growth, but relatively less unequal than the other 
Mediterranean countries. In fact, Greece still keeps harsher levels of inequality 
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despite the Gini index has increased in 2005-2013 less than in Italy. Wage inequality 
has largely increased also in Spain; it has remained constant for Portugal, coherently 
with the literature [14] that shows a reversal of the previous increase in income 
inequality since 2005, although the decrease has not been large enough to 
compensate for the strong growth of inequality during 1989-1994. 


Table 1 RIF decomposition of Gini on log-wage. Gap 2005-2013. Western countries 


France Germany The UK 
Total gap -0.0041°** — -0.0043”** — 0.0011° — 
Composition -0.0021"" 51.2% — -0.0039"* 90.7% -0.0004 -36.4% 
Wage structure -0.0020™* 48.8% -0.0004 9.3% 0.0015% 136.4% 


FRE 


“Significant at 10%; “Significant at 5%; ‘’*Significant at 1%. 


Table 2 RIF decomposition of Gini on log-wage. Gap 2005-2013. Mediterranean countries 


Italy Greece Portugal Spain 


Tot. gap 0.0064°* = 0.0037” - -0.0001 0.0101” = 
Comp. 0.0004 6.3% 0.0019" 52.6% 0.0041 -0.0021"** -20.8% 
Wage str 0.0060*** 93.7% 0.0017°** 47.4% -0.0042  0.0122** 120.8% 


“Significant at 10%; “Significant at 5%; ‘’’Significant at 1%. 


Table 3 RIF decomposition of Gini on log-wage. Gap 2005—2013. Central/Eastern countries 


Czech Republic Hungary Poland 
Total Gap -0.0060°** - -0.0119"" - -0.0176""* - 
Composition -0.0043*** 71.47 -0.0077°"* 64.71 -0.0040°* 22.73 


Wage structure -0.0017°" 28.73 -0.0042""* 35.29 -0.0136** 77.27 


“Significant at 10%; “Significant at 5%; ‘’’Significant at 1%. 


In countries that have experienced a decline of wage inequality, a great deal of 
the total changes is due to the composition effect. Therefore, up to more than 90% 
for Germany (where the wage structure is even not significant), three-quarter for the 
Czech Republic and two-third for Hungary of the reduction of wage inequality 
depends on the changes in workers” characteristics happened over time. In other 
words, in these countries, the endowments in employees’ characteristics and 
potentialities have contributed more effectively to decrease, or at least not to 
increase, wage inequality. Instead, the wage structure plays a leading role (Spain) — 
if not exclusive (the United Kingdom, Italy) — in increasing wage inequality, 
stressing the low capacity of the countries’ labour markets to transform inputs into 
better job-related careers and higher earnings. Not only the skill endowments but the 
ways in which they are rewarded in the labour market are crucial in explaining 
differentials in wage inequality over time. A more detailed analysis (whose results 
are not reported for brevity) has identified the human capital endowments and job- 
related characteristics as the individual resources that mostly contribute in shaping, 
in one direction or another, wage inequality differentials within the two components 
of composition effect and wage structure. 

In sum, those countries that experienced a decrease (or at least a not increase) in 
wage inequality — France, Portugal, Poland, the Czech Republic, Hungary and 
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Germany — share shifts in the employment composition between 2005 and 2013 that 
have led to more explicit and clearly defined structures of their labour markets 
(upgrading or relatively upgrading, polarisation or relatively polarisation). Probably, 
the employment changes, which have led the labour markets towards more upgraded 
or polarised structures, usually less unequal, discontinued the inequality growth 
within the country with an equalising effect on the wage distribution. In Greece, the 
employment changes towards a more polarised pattern have only slowed the growth 
of inequality within the country, mainly due to the recent crisis that has hit Greece 
so even harder. Conversely, in Italy and Spain, where the distribution of occupations 
by skill levels appears to be more ambiguous, the increasing differentials in wage 
inequality are mostly attributable to the lower efficiency of their labour markets to 
offer better job opportunities and careers, and thus, better salaries for employees. In 
other words, the outright deterioration of all jobs, irrespective of skill levels 
required, and the lack of a clear structure of the Italian and Spanish labour markets 
have exacerbated disparities among the three sub-groups of employees, increasing 
the overall inequality within countries. 
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Official Statistics 4.0 — learning from history for 
the challenges of the future 


Statistica Ufficiale 4.0 - apprendere dalla storia per le 
sfide del futuro 


Walter J. Radermacher, La Sapienza University Rome, wjr@outlook.de 


Abstract 

The quantity of digital data created, stored and processed in the world has grown 
exponentially. The demand for statistics and the power of facts has never been so 
apparent. In the process of adapting to the new reality, Official Statistics will continue 
to focus on relevant products, efficient production processes and quality of statistical 
information. Quality, trust and authority are central to Official Statistics in a modern 
democratic society, holding a neutral and impartial position between the political 
decision-makers and citizens. The article underlines the importance of statistical 
governance and quality management. 


Abstract La quantità di dati digitali creati, memorizzati e elaborati nel mondo è 
cresciuta in modo esponenziale. La domanda di statistica e la potenza dei fatti non è 
mai stata così notevole. Nel processo di adattamento alla nuova realtà, le statistiche 
ufficiali continueranno a concentrarsi sui prodotti rilevanti, sui processi di 
produzione efficienti e sulla qualità delle informazioni statistiche. La qualità, la 
fiducia e l'autorità sono al centro della statistica ufficiale di una società democratica 
moderna, con una posizione neutra e imparziale tra i decisori politici ed i cittadini. 
L'articolo sottolinea l'importanza della governance statistica e della gestione della 
qualità. 


Key words: Official Statistics, Statistical Governance, Quality management, Data 
Revolution 


A (simplified) model of reality: statistics 


1? 


There is no alternative to facts!” or “Science belongs to everyone” were slogans of 
the March for Science on ‘Earthday’ 22. April 2017, a “call for science that upholds 
the common good and for political leaders and policy makers to enact evidence based 
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policies in the public interest.”(MarchforScience 2017). Hopefully, this initiative will 
be inspired by scientific reflexions concerning the character, role and limits of science 
(Benessia et al. 2016), thus representing a ‘new enlightenment’. 
(EuropeanAlpbachForum 2016) 


"No substantial part of the universe is so simple that it can be grasped and controlled 
without abstraction. Abstraction consists in replacing the part of the universe under 
consideration by a model of similar but simpler structure. Models, formal or 
intellectual on the one hand, or material on the other, are thus a central necessity of 
scientific procedure. ... That is, in a specific example, the best material model for a 
cat is another, or preferably the same cat. In other words, should a material model 
thoroughly realize its purpose, the original situation could be grasped in its entirety 
and a model would be unnecessary."(Rosenblueth and Wiener 1945) 

In social sciences this relation between reality and a model of it has been introduced 
by Max Weber in form of what he called "Idealtypen". "According to Weber's 
definition, “an ideal type is formed by the one-sided accentuation of one or more 
points of view” according to which “concrete individual phenomena ... are arranged 
into a unified analytical construct” (Gedankenbild); in its purely fictional nature, it 
is a methodological “utopia [that] cannot be found empirically anywhere in reality” 
.... Keenly aware of its fictional nature, the ideal type never seeks to claim its validity 
in terms of a reproduction of or a correspondence with reality. Its validity can be 
ascertained only in terms of adequacy, which is too conveniently ignored by the 
proponents of positivism."(Kim 2012) 

It is of fundamental importance for Official Statistics, that conceptual models are 
designed in such a way, that they are 'adequate' abstractions of reality. This leads to 
the question, what ‘adequate’ concretely means or which criteria and which processes 
are offered by statistical methodology to answer this question. Compared to other 
quality components of statistical information (e.g. sampling errors), this area is 
however less covered by statistical theory. 

In German statistical terminology, ‘Adaquation’ (Grohmann 1985) represents the 
design-phase within the process of statistical knowledge building, which contains 
basically the choice of model parameters according to the purpose of the research, 
available resources, time constraints etc. “Data quality is depending on ... developing 
operational methods corresponding as much as possible to theory and - intensively 
controlling and monitoring the survey procedure in process."(Radermacher 1992) 


This will say, statistical information is produced with two main ingredients: 
methodology and conventions. On the one hand, “the notion of statistics as a primarily 
mathematical discipline really developed during the 20th century, perhaps up to 
around 1970, during which period the foundations of modern statistical inference 
were laid’(Hand 2009). On the other, the final products of statistical processes 
depend essentially on their conceptual design, which, like for other (manufactured) 
products, depends essentially on the fact whether the questions raised by stakeholders 
can be answered by statistics and whether they are answered in a satisfactory manner. 
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The production of Official Statistics 


A simplified circular process chart describing the interaction between users and 
producers of information should help to understand the main features: 


Figure 1: Knowledge generation and statistical production process 
users of statistics 


development 
+ work program, priorities 
+ products, services 


information need PO 


statistical eee statistical 


information data + metadata 
communication 


The key-processes within the production sphere of this chart are 

e D: development and design, 

e P: production, 

e C: communication/dissemination, 

which corresponds to widely accepted standards, such as the Generic Statistical 
Business Process Model (GSBPM) (UNECE 2013) or the Generic Statistical 
Information Model (GSIM) (UNECE 2017). In addition, it is essential to include 
explicitly the interaction with stakeholders and the following process on the user side 
e U: creation of knowledge and application. 


The ultimate goal of statistical evidence is to contribute to better informed decisions 
of all kind and for all types of users, which can only be achieved when all four 
processes are taken into account and integrated in a comprehensive conceptual 
approach. Each of them should contribute to excellent information quality. Each of 
them can of course also fail and contribute to errors, misunderstandings and 
underperformance: 

e The process D has an external part (dialogue with users) and an internal part 
(development and testing of methods). Intensive cooperation with users is crucial 
for the adequacy of the entire process chain that follows. 

e During the production process P the methods agreed in the preceding development 
phase are implemented. It is relatively straightforward to measure the quality of 
this process and its sub-processes against these predefined norms. 

e Communication processes C represent the other end of the user interaction. They 
can also be grouped in an internal part (preparing the results from the production 
process for different channels, access points etc.) and an external part (interaction 
with users in all formats and through all channels). The internal part does also 
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belong to the set of predefined methods and is in that way similar and closely linked 
to production. 

e The processes of application and use U are not under any kind of control or 
influence by statisticians. It is however obvious that users might not be sufficiently 
prepared or trained to interpret and use statistics in the best possible manner. 
Statistical literacy is therefore an area of interest also for statistical producers. 
Furthermore, statisticians should carefully observe cases of wrong interpretation 
and they must protect their information against misuse. 


Evolution and continuous adaptation 


This process chart helps us to understand the planning of Official Statistics as an 
evolutionary process, as a sequence of learning cycles and feed-back loops: 


Fig. 2: Statistical learning process 


User needs 


Data 
Statistical sources, 
methods production 


processes 


Over time, changes might be started from all three angles. New demands and political 

issues trigger new statistical developments, as new data sources or new methodologies 

do. Historically, it can be observed that these driving forces are also mutually 

influencing each other’s, thus stimulating new episodes in official statistics 

(Desroisiéres 1998). 

It is therefore essential to link the communication process of today with the 

development of statistics for tomorrow. Partly, this loop could be a short one, if user 

feedbacks can lead to quick fixes and improvements in services. Partly, it might 

however take time, since changes in a programme need profound preparations and 

even more profound developments. 

This evolutionary development of statistics is confronted with several limiting factors, 

which could be practical limitations, such as: 

e Clandestine, non-observable phenomena 

e Statistical items in the future and elsewhere, relevant for decisions now and here 
(e.g. capital goods, depreciation, trade chains, sustainable development) 

e Values and prices for non-market-goods (Can we simulate non-existing markets?) 

e Limitations by resource or time constraints; in cases where only a limited amount 
of information and data are available or where limited time is given for the 
decision-making process. 

Limitations could also relate to the understanding and use of data and information: 

e Innumeracy, statistical and data illiteracy 
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e High and too high expectations 
e (No) Appetite for high quality information (Davies 2017) 


Since the beginning of Official Statistics in the nineteenth century, the boundaries 
have been substantially stretched. Continuous improvement has opened new 
opportunities so that today many subjects (e.g. quality of life) , which were impossible 
to observe only a few years ago, are fullly integrated elements of the standard 
statistical program. Nevertheless, it is crucial to understand that basic principles must 
be respected, if the fundament, on which trust in Official Statistics is built, shall not 
be damaged. This is for example the reason to refrain from monetising natural 
resources and their services, if they are not valued by market transactions. 


Consultation of users 


Official Statistics is a special application and form of statistics that belongs to the 
public infrastructure of (modern) states. Working methods in Official Statistics reflect 
both, their political and administrative position as well as the status and development 
of societies, (i.e. the specific relationship between state and citizens). 

In a modern and democratic definition, Official Statistics is no longer a knowledge 
tool in the hands of the powerful and mighty. Rather, it must follow principles of 
neutrality and impartiality, whereby this information infrastructure becomes an 
important democratic pillar, equally available and accessible for everyone. 
(Radermacher and Bischoff 2017 forthcoming) 


Article 338 TFEU (EuropeanUnion 2012) 


1. Without prejudice to Article 5 of the Protocol on the Statute of the European System of 
Central Banks and of the European Central Bank, the European Parliament and the Council, 
acting in accordance with the ordinary legislative procedure, shall adopt measures for the 
production of statistics where necessary for the performance of the activities of the Union. 


2. The production of Union statistics shall conform to impartiality, reliability, objectivity, 
scientific independence, cost-effectiveness and statistical confidentiality, it shall not entail 
excessive burdens on economic operators. 


The interaction with stakeholders must be governed by principles of transparency, 
democratic control/supervision and public/legislative form of all kind of conventions. 
In particular, the programme of work must emerge from a democratic decision making 
process, at the end of which a choice is made in favour of the 'Pareto-optimal' 
composition of statistical tasks. Priority setting in this context has an important role 
to play, as it must facilitate the annual adaptation of the program following changes 
in user needs. 

The way, in which this consultation and decision making was organised so far, relies 
mainly on the functioning of 'official' procedures concerning the preparation of 
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legislation and political decisions. Modern societies ask however for more; more in 
terms of wider consultation (more room for all active contributions from civil 
societies), new forms (collection of user needs through social media) and faster 
(quicker adaptation of the program). 


Official Statistics 3.0 — the last 25 years 


Since the beginning of the 1990", the environment around statistics has dramatically 

changed due to several factors (Radermacher 2014b), such as: 

e Pressure on the public sector; major cuts in budgets and human resources 

e Reduced willingness to respond to statistical surveys; response burden as a political 
target 

e Exponentially growing importance of ICT and new data sources (e.g. 
administrative data, GIS) 

e New political demands (e.g. environment, globalization, migration) and crises (e.g. 
financial) 


These changes are expressed in the Regulation of European Statistics: “Whereas 14: 
The operation of the ESS (the author: European Statistical System) also needs to be 
reviewed as more flexible development, production and dissemination methods of 
European statistics and clear priority-setting are required in order to reduce the 
burden on respondents and members of the ESS and improve the availability and 
timeliness of European statistics. ” (Eurostat 2015b) 


Changing the business model 


A widely-supported starting point concerning the strategic orientation for Official 
Statistics is: “Our output has traditionally been determined by the demands of our 
respective governments and other customers. The process is one of reasoning back 
from the output desired to survey design because often few or no pre-existing data 
were available. This paradigm has shaped the way official statistics are designed and 
produced. ... In the future it will become increasingly unrealistic to expect meaningful 
statistics from this approach, even when results are collected and transmitted 
electronically. ” (Vale 2017) 


Towards multiple source — mixed mode design 


Since the end of the 1990", a re-engineering of the business model is ongoing, 
according to which the single statistical production lines are bundled and integrated, 
common technical tools are developed and terminology is standardised, thus 
minimising redundancies, inefficiencies and sources of incoherence. Information is 
generated by (re-)using available data as far as possible, aiming at minimising 
response burden and costly surveys. 
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In terms of the above-mentioned learning process this means that the current 
‘development loop’ is driven by changes on the production side, which will lead to 
substantial improvements. 

Nevertheless, one must take the implications for the other actors of the learning cycle 
into account. For example, it was not difficult in the past and with the traditional 
business model to organise functioning user-producer dialogues, since participants of 
these dialogues shared the interest and knowledge of same subject matter: Agricultural 
statistics was discussed between the specialists for agricultural policies and the 
technicians in a specialised branch of the statistical office; the same applied for labour 
market, population, health statistics and so forth; a balanced agreement sufficient for 
static and narrow user needs. As long as statistical offices did not have to cope with 
substantial resource scarcities (and rapidly chnaging user needs), it was therefore not 
necessary to establish an overall program-planning, to decide on priorities etc. The 
program was just the sum ofa great number of partial solutions in each separated area; 
both users and producers were generally satisfied; users with their tailor-made 
products and producers with their control of the entire production process. This 
inefficient ‘spaghetti-bowl-business-model’ of the past is replaced the new 
“industrialised-process-model’: multiple-source inputs, standardised production, 
multiple purposes output. The new business model of production cannot be 
‘administrated’ in a traditional manner. It needs to be ‘managed’, including the 
development of planning tools, a catalogue of products / services, marketing and cost 
accounting, which means not less than a complete overhaul of the traditional culture 
in Official Statistics. 


New components in the statistical programme 


Firstly, new products and services will be generated on this way. For example, if 

populations censuses are moving towards a new design, it is possible to produce 

results not only every five or ten years but annually, which would better fit to 
information needs in times of high population dynamics (e.g. migration, ageing). (Kyi 

and Knauth 2012) 

Secondly, the more integrative approach has stressed the fact that the different 

statistical bits and pieces should form a coherent information system, in which 

different types of products have their place. Within this system, basic statistics, macro- 
economic accounts and indicator sets have different functions and fulfil different 
roles. 

Finally, genuine new ‘metrics’ are requested for new political debates, consequences 

of crises or other pressing demands from different stakeholders, which could be 

demonstrated with two examples: 

e Sustainable development: In 2009 the Commission on the Measurement of 
Economic and Social Progress issued a report, which highlights the need to go 
‘beyond GDP’ (Stiglitz, Sen, and Fitoussi 2009), in 2015 the General Assembly of 
the United Nations adopted the ’Sustainable Development Goals’ (UnitedNations 
2016), both having already caused significant changes in statistical programs. 

e Globalisation: Recent reports from Sturgeon (Sturgeon 2013) Bean (Bean 2016) 
and the ‘Economic Statistics Review Group’(ESRG 2016) highlight the 
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shortcomings of economic statistics concerning the monitoring of globalised 
production processes. New indicators have been proposed, which will have 
consequences also for routines in basic statistics and ‘national’ accounts. 


Strengthening the statistical governance 


Complexity of the production process is significantly increased when the new 
business model is introduced. At the same time, statistical information has grown into 
a powerful tool in the political arena. New opportunities come with new risks 
(Radermacher 2017 forthcoming), which are taken into account by an adaptation of 
firstly the legal frames of Official Statistics, secondly the Codes of Conduct and 
thirdly the broadening and deepening of stakeholder consultation. 


European Statistics 


European political developments have asked for customised statistical solutions. The 
introduction of an European single market has created the need for another form of 
external trade statistics (i.e. Intrastat), the Maastricht treaty asked for special statistical 
monitoring (i.e. EDP statistics), the European Central Bank requested solid and 
comparable price statistics (i.e. HICP) (Radermacher 2012). 

European statisticians were at the forefront of the international modernisation 
activities. With ‘Vision 404’ (Eurostat 2009a) a strategy for the next years was 
outlined, which contained all three dimensions: process, product and governance. 
With the communication ‘GDP and Beyond’ (Eurostat 2009b) the European 
Commission has set up a work program aiming at substantial improvements of 
statistical information. 

Meanwhile, these plans have been implemented: 

e The governance of the European Statistical System was substantially revised 
(Eurostat 2015b; Radermacher 201 4a), 

the multi-annual programme adapted to the new business model (Eurostat 2013b); 
Sustainable Development indicators have been developed (Eurostat 201 6b); 

the accounting layer has been modernised and broadened (Eurostat 2013c, 2016a); 
basic statistics have been re-engineered (e.g. demographic statistics, Census HUB, 
HICP, integrated social and agricultural statistics, FRIBS). 


Close cooperation amongst the partners in the European Statistical System was the 
enabling factor to forcefully implement the strategy that was outlined in the ESS 
Vision 2020 (Eurostat 2013a). 


In terms of user orientation important new initiatives were taken, such as: 

e Relaunch of the website, new visualisation tools, active use of social media, 
DIGICOM (Eurostat 2017b); 

e Public consultation in the preparation of new legislation (e.g. (Eurostat 2015a)); 

e Indicators as user interface (Eurostat 2017a) and Conferences of European 
Statistical Stakeholders (Smedt 2016). 
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Official Statistics 4.0 — data revolution 


The quantity of digital data created, stored and processed in the world has grown 
exponentially. Broad consensus reigns regarding the wonderful opportunities, which 
‘Big Data' can bring in relation to the statistics acquired from traditional sources such 
as surveys and administrative records. Much faster and more frequent dissemination 
of data; responses of greater relevance to the specific requests of users since the gaps 
left by traditional statistical production are filled; refinement of existing measures, 
development of new indicators and the opening of new avenues for research; a 
substantial reduction in the burden on persons or businesses approached and a 
decrease in the non-response rate are all possibilities potentially offered. Access to 
Big Data could considerably reduce the costs of statistical production. 

However, the Big Data phenomenon also poses a certain number of challenges: These 
data are not the result of a statistical production process designed in accordance with 
standard practice. They do not fit the methodologies, classifications and definitions, 
and are therefore difficult to harmonise and convey in statistical structures. Complex 
aggregates, such as the GDP or the Consumer Price Index aim at measuring macro- 
economic indicators for the nation as a whole (Lehtonen 2015); their (immediate) 
substitution by Big Data sources seems to be out or reach. In addition to this, major 
legal issues are raised: security and confidentiality of data, respect for private life, 
data ownership, etc. AIl the above means that, at least for now, Big Data can be used 
only to a limited degree to supplement rather than replace sources of traditional data 
in certain statistical fields. Their integration is — besides all technical aspects, a 
challenge for an “informational governance” (Soma et al. 2016). 


Conclusions and guiding principles (Radermacher and Baldacci 2016) 


Statistics is a key for people empowerment 

High-quality statistics strengthen democracy by allowing citizen access to key 
information that enhances accountability. Access to solid statistics is a fundamental 
"right" that permits choices and decision based on information. Without statistics 
there cannot be a well grounded and participated democracy. 

Statisticians should be aware of the power of data which lies in their transformation 
of information services for knowledge. 

Open data are fundamental for open societies 

Statistics are the cornerstone of public open data. They are the basis of open 
government. In the EU Open Data Data Portal, Eurostat statistical database accounts 
for the bulk of data offered. Enhancing access to statistics in open formats enables the 
free use of data, its interoperability and consumption in integrated modalities. Open 
statistics as a result allow to make sense of complex phenomena and help in their 
interpretation without borders and limits. As such open statistics are a key sources of 
free dialogue in our societies. 

Statisticians should ensure open and transparent access to data and metadata and 
measure their actual use for information and knowledge. 

Datacy is a key enabler for citizens 
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Statistical literacy is critical to ensure that individuals can benefit from the power of 
data and can make use of open access to statistical information and its associated 
services. Data literacy is not limited to knowledge of basic statistical information, it 
entails knowing the limit of statistics and their use/misuse. Capabilities to understand 
statistics and how they are produced are a fundamental skill for a whole individual 
and an aware citizen. 

Statisticians should proactively invest in datacy capabilities in society at large and 
measure the results of statistical literacy. 

The future is smart statistics 

The value of data is in the statistical methods which ensure quality services. In the 
digital ecosystem where data are abundant and a commodity, the value of information 
is increasingly based on algorithms that generate tailored insights for users. 
Statisticians should continue to invest in methods and algorithms that enhance the 
quality of data for statistical services tailored to users’ needs. 

More influence means more responsibilities 

As statistical information is increasingly used for policy decisions, statisticians need 
to investigate how their services are used, the ethical implications and the impact of 
evidence use on the policy cycle. 

It is a duty of statisticians to explore the link between statistics, science and society 
and lead intellectual reflections on the possible risk of reliance on data-centrism. 
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Comparison of contingency tables under 
quasi-symmetry 


Confronto di tabelle di contingenza sotto l’ipotesi di 
quasi-simmetria 


Fabio Rapallo 


Abstract In this work we define a test to compare several square contingency tables 
under the quasi-symmetry model. Working within the class of log-linear models, 
we present a suitable model and an exact test to verify if two or more tables fit a 
common quasi-symmetry model. The exact test is then defined through classical 
tools of Algebraic Statistics, namely the computation of a Markov basis and the 
application of a MCMC algorithm. 

Abstract In questo lavoro viene definito un test per confrontare tabelle di contin- 
genza quadrate sotto l’ipotesi di quasi indipendenza. Rimanendo nella classe dei 
modelli log-lineari, si definisce un appropriato modello e un test esatto per verifi- 
care se due o più tabelle possono soddisfare un comune modello di quasi-simmetria. 
Il test esatto è definito tramite i classici strumenti della Statistica Algebrica, ossia il 
calcolo di una base di Markov e l’applicazione di un algoritmo di tipo MCMC. 


Key words: Algebraic Statistics, exact tests, Markov bases, MCMC algorithms 


1 Introduction 


Complex models for contingency tables have received an increasing interest in the 
last decades from researchers and practitioners in different fields, from Biology to 
Medicine, from Economics to Social Science. As general references for the sta- 
tistical models for contingency tables see [1] and [7]. Quasi-symmetry and quasi- 
independence models are well known log-linear models for square contingency ta- 
bles. Under these models, it is possible to fix the diagonal counts, or even to analyze 
incomplete tables where the diagonal counts are undefined or unavailable. In the 
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next section we will recall the basic facts on the quasi-symmetry model, while for a 
full presentation and an historical overview the reader can refer to [3] and [6]. 

Since quasi-symmetry model is a log-linear model, one can test the goodness 
of fit through the classical chi-squared approximation of the Pearson or the likeli- 
hood ratio test statistics. Alternatively, exact tests have been introduced for quasi- 
symmetry and for other classes of log-linear models. Apart from the independence 
model, exact tests are usually difficult to implement, and the new techniques intro- 
duced with Algebraic Statistics has allowed a noticeable progress trough the notion 
of Markov basis and the definition of the Diaconis-Sturmfels (D-S) algorithm. No- 
tice that the exact approach is particularly important for quasi-symmetry models, 
because the asymptotic approximation fail also with moderately large sample sizes, 
as noted in [9]. 

Algebraic Statistics has been a very growing research area, with major appli- 
cations to the analysis of contingency tables. In addition to a general algorithm 
for exact inference, Algebraic Statistics provides an easy description of complex 
log-linear models for multi-way tables and it represents the natural environment to 
define statistical models for contingency tables with structural zeros, through the no- 
tion of toric models. Toric models are generalization of log-linear models allowing 
also zero-probability cells. As general references on the use of Algebraic Statistics 
for contingency tables, see [5] and [2]. Some specific statistical models related to 
quasi-symmetry in the framework of Algebraic Statistics can be found in [10]. 

In this paper, we use classical techniques from Algebraic Statistics in order to 
compare several contingency tables under the quasi-symmetry model. In particular, 
we present an exact test to verify if two or more tables fit a common quasi-symmetry 
model, versus the alternative hypothesis that each table follows a specific quasi- 
symmetry model with its own parameters. This is accomplished by the construction 
of a suitable three-way table and the definition of new log-linear models for this new 
table. The exact test is then derived by applying the D-S algorithm. 

The material is organized as follows. In Sect. 2 we recall some definitions and 
basic results about log-linear models and toric models, with special attention to 
quasi-symmetry. In Sect. 3 we show how to study define suitable log-linear models 
to compare two or more square tables under quasi-symmetry, together with the de- 
scription of the D-S algorithm for this application. Finally, Sect. 4 is devoted to the 
illustration of a real-data example and some pointers to future works. 


2 Log-linear models and quasi symmetry 


A probability distribution on a finite sample space 2 with K elements is a nor- 
malized vector of K non-negative real numbers. Thus, the most general probability 
model is the simplex 


K 
a= flon p) : ZO, Eai). 
k=1 
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A statistical model .// is therefore a subset of A. 

A classical example of finite sample space is the case of a multi-way contingency 
table where the cells are the joint counts of two or more random variables with 
a finite number of levels each. In the case of square two-way contingency tables, 
where the sample space is usually written as a cartesian product of the form & = 
{1,...,J} x {1,...,/} we will use the notation p;,j to ease the readability. 

Following the classical theory of log-linear models, under the Poisson sampling 
scheme the cell counts are independent and identically distributed Poisson random 
variables with means Np1,...,Npx, where N is the sample size, and the statistical 
model specifies constraints on the raw parameters p1,...,px. A model is log-linear 
if the log-probabilities lie in an affine subspace of the vector space RÝ. Given d real 
parameters ),...,Q, a log-linear model is described, apart from normalization, 
through the equations: 


d 
log(px) = Y Arrr (1) 
pæl 


for k = 1,...,K, where A is the model matrix (or design matrix), see in [8]. Expo- 
nentiating Eq. (1), we obtain the expression of the corresponding toric model 


dd 
p= [[o" (2) 
=l 
for k = 1,...,K, where ¢, = exp(0,), r = 1,...,d, are the new non-negative pa- 


rameters. It follows that the model matrix A is also the matrix representation of the 
minimal sufficient statistic of the model. The matrix representation of the toric mod- 
els as in Eq. (2) is widely discussed in, e.g., [11] and [5]. Note that from Eq. (1) it 
follows that different model matrices with the same image as vector space generate 
the same log-linear model. 

The log-linear form of the quasi-symmetry model is 


log(pij) = H+”) +BY +4; 9 (3) 


with the constraints 
L œ L 0) 
Lo =0, LB; =0, Yj =i ij=1, 1. 


In Eq. (3), the a) are the parameters of the row effect, the p” are the parameters 


of the column effect, while the parameters %; j force the quasi-symmetry. Comparing 
Equations (1) and (3) it is easy to explicitly write the model matrix Ags for the quasi- 
symmetry model. 
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3 Comparison of several tables under quasi-symmetry 


As outlined in the introduction, in this section we define two log-linear models to 
compare two or more square tables under quasi-symmetry. Let us consider H tables 
(H > 2) and define a three-way contingency table T by stacking the H tables. Con- 
versely, each original table is a layer of the T. Let K’ = HI? be the number of cells 
of T. The two models are defined as follows. The first model Mọ is defined by 


x Y 
(Mo) log(pn;i,j) = H + Mn of) H 15; + Yh,i,j (4) 
with the constraint yi 1 Hn = 0 in addition to the constraints on al ni Be) and 
Yn.i,j naturally derived from the basic quasi-symmetry model in Eq. (3). The second 
model M, is defined by 


(Mi) log(pn;i.j) = H + Hn + a) HB; +%; (5) 


with the constraint TH HUn = 0 in addition to the constraints on a), BY and %,j 
naturally derived from the basic quasi-symmetry model in Eq. (3). It easy to see 
that Mı C Mo. Under the model Mo we assume that each layer of the table follows 
a quasi-symmetry model with its own parameters, while under the model Mı we 
assume that all the layers follow a common quasi-symmetry model. In terms of the 
model matrix, the models My and M, have a simple block structure. In fact: 
Ai, Afs DE Als 
t i A t — 
Aly, = al and Am, = 


0 
t 
A qs 


lx 


where 1x is a row vector of 1’s with length K = J’, and each empty block means a 
block filled with 0’s. 

Let f be the observed table of counts, and write f as a vector of length K’ ac- 
cording to the row labels of Am. The test for nested models with null hypothesis 
Ho: p € Mı C Mo versus Hi : p € Mo can be done using the log-likelihood ratio 


statistic 
onafiam(f) 


where fox and h x are the maximum likelihood estimates of the expected cell counts 
under the models Mo and M; respectively. In the asymptotic theory, the value of G? 
must be compared with the quantiles of the chi-square distribution with the appro- 
priate number of degrees of freedom, depending on the dimensions of the table. 

We introduce here a procedure for exact inference via Markov bases and the D-S 
algorithm, see [5] for details. Given the observed table f, the key idea of the D-S 
algorithm is to make the reference set of a given table 
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/ K'. / 
Ff) = {Pf ENE : Alyf'=Amf} 


connected. This is done through a set of moves 444, i.e., integer-valued tables with 
null value of the sufficient statistics Am. If the set ./ contains enough moves 
such that each table of .7(f) can be reached from f in a finite number of steps by 
adding/subtracting moves, we say that .// is a Markov basis. Once a Markov basis 
for the model is available, the D-S algorithm is a MCMC algorithm and proceeds as 
follows. At each step: 


1. let f be the current table; 

2. choose with uniform probability a move m € .@ and a sign € = +1 with proba- 
bility 1/2 each; 

. define the candidate table as f} = f + em; 

4. generate a random number u with uniform distribution over [0, 1]. If f} > 0 and 


(SS) 


then move the chain in f+; otherwise stay at f. Here # denotes the hypergeo- 
metric distribution on F (t) 


After an appropriate burn-in-period and taking only tables at fixed times to reduce 
correlation between the sampled tables, the proportion of sampled tables with test 
statistics greater than or equal to the test statistic of the observed one is the Monte 
Carlo approximation of p-value of the log-likelihood ratio test. The results in the 
next sections are based on Monte Carlo samples of size 10,000. 


4 Example 


As a simple numerical example we consider the data reported in Tab. 1 (adapted 
from [4] and originally collected during the “Indagine Longitudinale sulle Famiglie 
Italiane (Italian Household Longitudinal Survey), where the inter-generational so- 
cial mobility has been recorded on a sample of 4,343 Italian workers in 1997. The 
data take into account the gender, and thus we have separate tables for men and 
women. There are 4 categories of workers. A: “High level professionals”; B: “Em- 
ployees and commerce”; C: “Skilled working class and artisans”; D: “Unskilled 
working class”. In [4] these data are analyzed extensively with a thorough presenta- 
tion of a lot of models to analyze special patterns of mobility. Here we merely use 
the simplified version displayed in Tab. 1 to show the practical applicability of the 
methodology introduced in Sect. 3. 

The relevant Markov basis has been computed with the software 4t i2 [12] and 
it consists of 151 moves. If we consider the two tables separately, we obtain ex- 
act p-values computed through the D-S algorithm are equal to 0.051 and 0.088 
respectively (G? = 6.703 and G? = 8.279 respectively, with 3 df). Running the test 


826 


Fabio Rapallo 


Table 1 Table of social mobility in Italy (1997). Columns represent the father’s occupation, rows 
represent the son’s (or daughter’s) occupation. Male respondents in the left panel, female respon- 
dents in the right panel. 


Saws 


A B C D A B C D 
172 31 31 28 A 137 52 29 15 
108 49 24 46 B 78 46 14 23 
174 84 301 272 C 142 100 124 145 
225 148 236 664 D 164 181 141 35 


described above to test a unique quasi-symmetry model, the D-S test produces a p- 
value equal to 0 (G? = 112.687 with 13 df), meaning that there is a strong departure 
from the null hypothesis. Combining these results, one can conclude that the two 
tables have strong differences in terms of patterns of mobility. 

Among the future directions of this research we mention: (a) the theoretical char- 
acterization of the relevant Markov bases in order to apply our technique also to 
large tables; (b) the use of this technique to make inference on other measures of 
mobility based on log-linear models, see [13] and [4] for an introductory overview 
of these measures with several examples from surveys in European nations. 
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Testing Beta-Pricing Models 
Using Large Cross-Sections 


Test per modelli di pricing multifattoriali utilizzando 
cross-sections di grandi dimensioni 


Valentina Raponi, Cesare Robotti and Paolo Zaffaroni 


Abstract Building on the Shanken (1992) estimator, we develop a new methodology 
for estimating and testing beta-pricing models when a large number of assets N is 
available but the number of time-series observations is small. We show empirically 
that our large N framework poses a serious challenge to common empirical findings 
regarding estimated risk premia and validity of beta-pricing models. We generalize 
our theoretical results to the more realistic case of unbalanced panels. The practical 
relevance of our findings is confirmed via Monte Carlo simulations. 

Abstract Partendo dallos timatore di Shanken (1992), il paper introduce una nuova 
metodologia per stimare e testare modelli di mercato quando il numero di assets 
N disponibili per l’analisi e’ molto elevato, ma la dimensione temporale T e’ pic- 
cola. Dal punto di vista empirico, viene mostrato come questa nuova metodologia 
sia in grado di fornire risultati molto diversi da quelli che solitamente si otterreb- 
bero applicando la metodologia standard. I risultati teorici, sia in termini di stima 
che di test, vengono generalizzati al caso piu’ realistico di panel non bilanciati. 
L’importanza dei risultati teorici viene confermata anche da un esercizio di simu- 
lazione Monte Carlo. 


Key words: Beta-pricing models; Ex-post risk premia; Two-pass cross-sectional 
regression; Large N asymptotics; Specication test; Unbalanced panel. 


1 Introduction 


Tens of thousands of stocks are traded every day in financial markets, providing an 
extremely rich information set to validate and estimate asset pricing models. At the 
same time, both academics and practitioners could be reluctant to use time-series 
spanning long time periods to avoid the risk of including structural breaks and to 
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avoid the additional difficulty of parameterizing time variation of betas and risk 
premia. 

Therefore, it is important to have a methodology that allows for carrying out 
statistically correct inference on risk premia and testing the validity of beta-pricing 
models, by exploring the large available cross-sectional variation of returns across 
N individual securities and, at the same time, relying on a limited number of time- 
series observations, T. Our main contribution in this paper is to develop such a 
formal methodology, built on the large-N estimator proposed by Shanken (1992). 

Theoretically, we provide a rigorous statistical analysis of the Shanken (1992) 
estimator of the ex-post risk premia and, more in general, provide a formal method- 
ology for estimation and testing of beta-pricing models in a large-N environment. 
To provide further motivation to our analysis, we show that the Shanken estimator is 
an element, with a special property, of a class of OLS bias-adjusted estimators of ex- 
post risk premia. In particular we demonstrate mathematically that it is the only ele- 
ment of this class not requiring a preliminary estimation of the bias-adjustment. This 
is a particularly convenient feature because it avoids, for example, any pre-testing 
biases and, at the same time, it does not require sacrificing data for preliminary es- 
timation. We then focus on the asymptotic properties of the Shanken estimator for 
large N and fixed T: under mild, easily verifiable assumptions, in particular per- 
mitting a degree of cross-correlation among returns, we establish VN-consistency 
and asymptotic normality. Moreover, we derive an explicit, and easy to interpret, 
expression for its asymptotic covariance matrix, showing how it can be consistently 
estimated and used to conduct inference on the risk premia estimates. 

In addition to estimation, we provide a new test for the validity of the asset- 
pricing restrictions and characterize its distribution for large N and fixed 7, under 
the null hypothesis that the model is correctly specified. Noticeably, our test has 
power, that is, it is able to discriminate whether the beta-pricing model is correctly 
specified, despite being built on the ex-post pricing errors, which are, necessarily, 
contaminated by the unexpected factor’s outcomes. 

To further motivate the importance of our methodology, we also explore the 
finite-N properties of the Shanken estimator and of our test statistic via Monte Carlo 
experiments, and compare them with the properties of traditional methodologies. 
Our simulations highlight how, when N is much larger than 7, the inference, based 
on the traditional t-statistics, can be severely misleading, even when accounting for 
the correct large-T standard errors. In contrast, the t-statistics of the Shanken esti- 
mator, based on our large-N standard errors, are correctly sized. 

We demonstrate the usefulness of our methodology by means of an empirical 
analysis that employs individual monthly stock returns from the CRSP database 
over overlapping three-, six- and 10-year periods from 1966 until 2013. The three 
prominent beta-pricing specifications that we consider are the Capital Asset Pric- 
ing Model (CAPM) of Sharpe (1964) and Lintner (1965), the three-factor Fama 
and French (1993) model, and the recently proposed five-factor Fama and French 
(2015) model. We find significant pricing ability for all the factors, for most peri- 
ods, for each of the three models, even when using a relatively short time window of 
three years. In contrast, the same risk premia appear insignificantly different from 
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zero when using the traditional approach of estimating risk premia based on large- 
T asymptotics. In terms of asset pricing test, our methodology tends to reject the 
CAPM even when using a short time window, in contrast to the traditional large-T 
approach of Gibbons, Ross and Shanken (1989). 

Although computationally appealing, it remains to verify whether the Shanken 
estimator Î* exhibits desirable (asymptotic) statistical properties. This is studied in 
the next Section, where we provide a formal asymptotic analysis of Î*. 


2 Asymptotic Analysis 


The analysis in this section assumes that N + ce and T is fixed. We first establish the 
limiting distribution of the Shanken bias-adjusted estimator Î* and explain how its 
asymptotic covariance matrix can be consistently estimated. We then characterize 
the limiting behavior of our test .7* of the asset-pricing restriction. 


2.1 Asymptotic Distribution of the Shanken Estimator 


In this subsection, we study the asymptotic distribution of *, under the assumption 
that the model is correctly specified, namely that exact no-arbitrage holds (Assump- 
tion 4 in Appendix A). 


1 
Let Ly = È | „0? =lim $ LM, 0f, Ue = lim} La E [vee eve} — 67 Tr )vec(€;€ — 


o7Ir)'|, M = Ir — D(D'D)~'D’, where Ir is a T x T identity matrix, D = [Ir, F], 
1 


0=%4-2?yY,andZ=(002)+ li Y? P'P, where all the limits are finite 
by our assumptions, as N — ce. In the following theorem, we provide the rate of 
convergence and the limiting distribution of Î*. 


Theorem 1. 


(i) Under Assumptions 1-5 (listed in Appendix A), 


X 1 
-rP =0,| — ). 1 

(7) = 

(ii)Under Assumptions 1-6 (listed in Appendix A), 
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Proof: See Appendix C and Lemmas 1 to 5 in Appendix B. 
To conduct statistical inference, we need a consistent estimator of the asymp- 
totic covariance matrix V + Ly Wry ' Let M2?) =MOM , where © denotes the 


Hadamard product operator. In addition, define 
o, Lat & 
3tr (MB) 
and let 
2=(009)+ EM gPa, © 
(7) 


with 
The following theorem provides a consistent estimator of the asymptotic covariance 


matrix of the Î* estimator. 
(8) 


Theorem 2. Under Assumptions 1-5 (listed in Appendix A), we have 
V+ (£x- EVIL 
(9) 


where 
22 

[+9 (EET A] 

(10) 


and Ug is a consistent plug-in estimator of Us described in Appendix D. 


Proof: See Appendix C and Lemmas 1 to 6 in Appendix B. 


2.2 Limiting Distribution of the Specification Test 
(11) 


The null hypothesis underlying the asset-pricing restriction can be formulated as 
for every i= 1,2,..., 


Ho 13 0 
where e; = E[Ri] — % — Biv1 is the pricing error associated with asset i. The null 
hypothesis Ho easily follows by simply rewriting Assumption 4. Let X; = [1, B/], 
X; = [1, B/], and denote by è? the ex-post sample pricing error for asset i. Then, we 


have 
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P =R- Rf (12) 


Since we estimate I” via OLS cross-sectional regressions, we propose a test based 
on the sum of the squared ex-post sample pricing errors, that is, 


2 es 
2= GL). (13) 
i=1 
Consider the centered statistic 


a2 


7=Vn( 3 = (14 REMY). (14) 


The following theorem provides the limiting distribution of .7 under Hp : e; = 0 for 
alli. 


Theorem 3. Under Assumptions 1-6 (listed in Appendix A), implying that Ho: ei = 
0 holds for all i, we have 


PLAN(0,Y), (15) 


where V = Zi,UsZo and Zo = (Q8 0) — TM. 0'0. 


Proof: See Appendix C and Lemmas 1 to 5 in Appendix B. 


3 Empirical Analysis 


In this section, we empirically estimate the risk premia associated with some promi- 
nent beta-pricing models, using individual stock return data, and investigate their 
performance. This demonstrates how the empirical results obtained using our large 
N methodology, illustrated in the previous section, can differ, even dramatically, 
from the results obtained with the more traditional large T methodologies. We con- 
sider three linear beta-pricing models: (i) the single-factor CAPM, (ii) the three- 
factor model of Fama and French (1993, FF3), and (iii) the five-factor model re- 
cently proposed by Fama and French (2015, FF5). The data on the above factors is 
available from from Kenneth French’s website. We use monthly data on individual 
stocks from the CRSP database, available from January 1966 to December 2013. 
We carry out the empirical analysis using balanced panel with three different time 
windows of, respectively, three- , six- and ten-year (i.e,. T=36, 72, and 120, respec- 
tively). For each of these time windows, we estimate each of the above beta-pricing 
models by rolling the window one month at the time. In this way, we obtain time- 
series of estimated risk premia and of the test statistic based on overlapping time 
windows of fixed length T. 

We document a sizeable difference between the results of our large N approach 
and the results of the conventional large T approaches. This outcome is a combi- 
nation of two elements, the extremely small standard errors associated with a very 
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large N and the bias-correction of the Shanken estimator that leads to an increment, 
on average of about 20% but sometimes up to 50%, of the risk premia estimates 
over the OLS estimator. 

We finally consider the performance of our specification test .7*. The result is, 
again, rather striking: in contrast to our large-N test, the GRS test is almost always 
unable to reject the CAPM at 5% when considering the shortest time window of 
three years of data (T = 36). The CAPM will be rejected about half of the time for 
the six years window and will almost always be rejected for the long time window 
of 10 years. 


4 Conclusion 


This paper is concerned with estimation of risk premia and testing of beta-pricing 
models when data is available for a large cross-section of securities N but only for a 
limited number of time periods. Because in this context the CSR OLS estimator of 
the risk premia is asymptotically biased and inconsistent, the focus of the paper is on 
the bias-adjusted estimator of the ex-post risk premia proposed by Shanken (1992). 
In terms of estimation, we demonstrate that the Shanken estimator exhibits desir- 
able properties, such as VN-consistency and asymptotic normality, as N diverges. 
In terms of testing, we propose a new test of the no-arbitrage asset pricing restric- 
tion and establish its asymptotic distribution (assuming that the restriction holds) 
as N diverges. Finally, we show how our results can be extended to deal with the 
more realistic case of unbalanced panels, allowing us to take advantage of the large 
cross-sections of stocks existing only for certain time periods. Monte Carlo simula- 
tions corroborate our theoretical finding, both in terms of estimation and in terms of 
testing for the asset pricing restriction. 

The usefulness of our methodology is demonstrated by means of an empirical 
analysis that employs individual monthly stock returns from the CRSP database We 
find some convincing pricing ability for all the factors, to different degrees, for each 
of the three models, even when using a relatively short time window of three years, 
for most periods. 
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On the use of predictive methods for ship fuel 
consumption analysis from massive on-board 
operational data 


Metodi di analisi predittiva dei consumi di una nave 
mediante dati massivi di navigazione 


Marco Seabra dos Reis!, Biagio Palumbo, Antonio Lepore, Ricardo Rendall, 
Christian Capezza 


Abstract Measuring, reporting and verification of ship fuel consumption are the 
main requirements imposed by upcoming European regulations. However, the 
massive amount of navigation data resulting from ship computerization is not easily 
handled by shipping operators because of the lack of standardized solutions. In this 
context, modern statistical and machine learning techniques provide effective 
methods to exploit the massive operational data available on modern ships and, in 
particular, can be used for building predictive models to estimate fuel consumption. 
With resort to real operational data collected from a Ro-Pax cruise ship owned by 
the Italian shipping company Grimaldi Group, this paper presents an extensive 
comparison study of modern predictive analytical methods (e.g. variable selection, 
penalized regression, latent variable methods and tree-based ensembles) in order to 
explore new directions in the analysis of ship fuel consumption. 

Abstract Le nuove regolamentazioni europee impongono, tra i principali 
requisiti, il monitoraggio, la documentazione e la verifica del consumo di 
carburante delle navi. L’enorme quantita di dati di navigazione disponibile 
mediante i moderni sistemi di acquisizione installati a bordo delle navi non é del 
tutto utilizzata dalle compagnie armatoriali per la mancanza di opportune tecniche 
di analisi. In tale scenario, è sempre più evidente la necessità di esplorare tecniche 
statistiche e di apprendimento automatico al fine di poter adeguatamente 
interpretare tali dati. Il presente lavoro propone un confronto critico dei metodi di 
analisi predittiva (e.g., metodi di selezione delle variabili, regressione penalizzata, 
analisi delle variabili latenti e metodi ensemble) mediante dati di navigazione reali 
acquisiti a bordo di una nave da carico e passeggeri, di proprietà della società 
armatoriale italiana Grimaldi Group S.p.a.. 

Key words: fuel consumption prediction, variable selection method, penalized 
regression, latent variable methods, tree-based ensembles. 
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1 Introduction 


In the last decades, increasing emissions of greenhouse gases by maritime transport 
has urged the European Commission to implement a new regulatory regime, 
requiring shipping companies to adopt procedures for the measuring, reporting, and 
verification of CO, emissions. This is achieved by analyzing fuel consumption data, 
which is directly related to CO2 emissions. In this regard, the marine engineering 
literature mainly relies on empirical curves that quantify fuel consumption as a 
function only of the vessel’s speed over ground. These curves are based on dedicated 
experiments where external/noise factors can be controlled and set to standard 
conditions [8], but they are not applicable in real environments where a large number 
of other variables also influence fuel consumption. On the other hand, modern ships 
can record a great amount of multi-sensor operational data through on-board 
automatic acquisition systems. These data can be analyzed using statistical and 
machine learning techniques in order to develop new solutions to the fuel prediction 
challenge. Thus, in this paper, modern predictive analytics based on four classes of 
methods (variable selection, penalized regression, latent variable methods and tree- 
based ensembles) are compared and discussed in order to explore new directions in 
the analysis of ship fuel consumption based on operational data (i.e. under non- 
standard conditions). 


2 Data Description and Comparison Framework 


Operational data were collected from a Ro-Pax ship owned by the Italian shipping 
company Grimaldi Group. The data cover one year’s worth of relevant observations 
and for confidentiality reasons, the ship’s name, route and voyage dates are 
intentionally omitted. The response variable is the fuel consumption per hour 
(FCPH) for each voyage. Table 1 shows the variables used to describe the ship’s 
operating conditions, which also serve as predictor variables in order to estimate 
FCPH. Further details can be found in [2]. 

In order to explore the operational dataset, the statistical and machine learning 
literature offers a diverse set of predictive methods that can be applied to predict fuel 
consumption based on data collected during the ship’s voyage. In this paper, the 
following classes of methods are investigated: variable selection, penalized 
regression, latent variable and tree-based ensemble methods. Note that multiple 
linear regression (MLR), one of the most tested and studied methods, is not 
considered because it does not cope with some characteristics of the dataset, namely 
the high collinearity among some predictors, which leads to unstable estimations of 
the regression coefficients and poor prediction intervals [4]. In this scenario, 
alternative methods are preferred to overcome the limitations of MLR. 

Variable selection methods stand on the assumption that only some variables have 
relevant predictive power, while the others can be discarded [1]. Forward stepwise 
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regression (FSR) is selected as the representative method of this class and is 
implemented by a sequential algorithm that evaluates, at each step, the predictive 
power of incrementally larger and smaller models through appropriate statistics. The 
statistic used to choose or discard variables is the p-value of a partial F-test and 
variables with p-values smaller than a specified threshold (pin) are included in the 
model while variables already in the model but with p-values bigger than a tolerance 
(Pout) are removed. 

Penalized regression methods introduce a penalty on the magnitude of the 
regression coefficients to make them stable and smaller. This increases model bias, 
but stabilizes the estimator variance [7]. In this class, four methods were considered: 
support vector regression (SVR), elastic net (EN), ridge regression (RR) and least 
absolute shrinkage and selection operator (LASSO). SVR minimizes the sum of 
squared regression coefficients to reduce model variance while constraining the 
prediction error to be below a threshold e, although slack variables are used to allow 
some errors to be above e [11]. The EN model is obtained by solving the following 
optimization problem: 


in amin E (10-50) +7[ aS» S250} 
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The two hyper-parameters are y that controls the bias-variance tradeoff and œ that 
weights the squared (6°) and the norm | penalties. RR is obtained by setting 


a=0, while LASSO is obtained by setting a =1. 

Latent variable methods are based on the assumption that the variability shared by 
predictors and response variables can be explained by a set of unmeasured 
quantities, called latent variables, estimated as linear combinations of the measured 
variables and used to predict the response. In this class, principal component 
regression (PCR), principal component regression with the scores added in a forward 
stepwise fashion (PCR_FS), and PLS regression are considered. PCR [9] 
decomposes the predictor space using principal component analysis (PCA), since 
most of its variability can often be explained by a number of principal components 
(Goce ) smaller than the number of predictors. Then, the principal components are 


used as predictors of the response variable and MLR is used to estimate their 
regression coefficients. In PCR_FS, the principal components are selected using the 
forward stepwise algorithm based on the p-value of the partial F-test, following an 
iterative process similar to FSR. PLS regression [10] chooses @,,, latent variables 


that maximize covariance between predictors and response variable. 

Tree-based ensemble methods [3] iteratively split the predictors’ space into smaller 
regions, reducing the response variability. In this work, trees are built until a 
minimum of five samples for each region are obtained and three methods were 
considered: bagging of regression trees (BaRTs), random forests (RFs) and boosting 
of regression trees (BoRTs). BaRTs and RFs are based on bootstrap to generate 
many datasets, which are used to build regression trees. The predicted response is 
the prediction average from all trees in the ensemble. The number of trees, Tgrr and 
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Tre, are the hyper-parameters that control the bias-variance tradeoff for BaRT and 
RF, respectively. Moreover, the BoRT method [5] exploits the residuals from 
previous trees to fit new regression trees. Its parameters are the learning rate u (fixed 
as u=0.02) and the number of trees Tgr. 

The prediction performances of each regression technique is evaluated through 
the root mean squared error of double cross-validation [6] RMSEacv. The dataset is 
randomly partitioned into a training and a test set: the former (80% of the data) is 
used to select the model hyper-parameters (Table 2), whereas the latter (20% of the 
data) is used for assessing prediction performance and to compute the RMSE«v. 
Since this measure is affected by the initial split of the dataset, the procedure is 
iterated 40 times in order to obtain a measure of variability. Moreover, 10-fold cross- 
validation is adopted for selecting hyper-parameters during model training and 
variables are transformed to zero mean and unit variance. 


Table 1: Operational variables measured for each voyage. 


Variable __ Description 


1 SG, Shaft generator power (port) [kW] 

2 SG; Shaft generator power (starboard) [KW] 

3 AP Power difference between port and starboard propeller shafts [KW] 
4 ASG Power difference between two shaft generators [kW] 
5 V Speed Over Ground (SOG) [kn] 

6 W, Following wind [kn] 

7 Wy Head wind [kn] 

8 W; Side wind [kn] 

9 Trp Departure draught (fore perpendicular) [m] 

10 Tp Departure draught (aft perpendicular) [m] 

1 Typ Departure draught (midship section - port) [m] 

12 To Departure draught (midship section - starboard) [m] 
13 Ty, Arrival draught (fore perpendicular) [m] 

14 Tu Arrival draught (aft perpendicular) [m] 

15 Ty, Arrival draught (midship section - port) [m] 

16 Ty, Arrival draught (midship section - starboard) [m] 

17 of SOG Variance [kn?] 

18 Trim, Departure trim [m] 

19 Trim, Arrival trim [m] 

20 A Displacement [Mt] 


3 Results and discussion 


Prior to the application of the comparison procedure described above and to the 
analysis of the best predictive methods, a pre-analysis of the operational data is 
conducted to identify potentially predictive variables and to summarize their main 
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Table 2: Summary table of the comparison framework with hyper-parameter value(s) considered for 
model training. For each of the 40 iterations, suitable values are selected by 10-fold cross-validation. 


Class Method _Hyper-parameters _Hyper-parameter value(s) 
Variable Pin 0.05 
selection POE Pout 0.1 
RR Y 0.002; 0.02; 0.2; 2; 20 
Penalized LASSO Y 0.001; 0.01; 0.1; 1; 10 
regression SEN a 0.001; 0.01; 0.1; 1 
Y 0.002; 0.02; 0.2; 2; 20 
SVR € 0.001; 0.005; 0.01; 0.05; 0.1 
PCR QPCR [1:min(20, n, p)] 
Latent in 0.05 
variable ECR FS p 0.1 
PLS APLS [1:min(20, n, p)] 
Tree Based BRT TBRT 50; 100; 500; 1000; 5000 
RF Tre 50; 100; 500; 1000; 5000 
ensemble- pr Tar 50; 100; 500; 1000; 5000 


characteristics. The first goal is achieved by computing the Pearson correlation 
between each predictor variable and FCPH as reported in Figure 1.a, where one can 
observe that speed over ground (variable #5) has the highest correlation coefficient 
followed by the starboard shaft generator power (variable #2) and the port generator 
power (variable #1). Thus, quantifying FCPH based only on speed over ground, as 
used for building empirical curves, is most likely the best univariate approach. 
However, by adopting multivariable methods, the information content of other 
predictors can also be used for obtaining more reliable predictions of FCPH. 

In order to perform an exploratory analysis, PCA was applied to the set of 
predictor variables. The first and second principal components are presented in 
Figure 1.b. Analyzing Figure 1.b, one can notice the existence of two clusters, which 
correspond to different levels of the port and starboard shaft generator (variable #1 
and #2, respectively). These clusters occur because the generator is turned off in 
some voyages and suggest that the regression methods might be applied separately to 
each cluster. However, as will be presented shortly, variable #1 and #2 were not 
important for the regression models and no clusters were observed for the response 
variable. To assess the prediction performance of the various regression methods 
included in the comparison study, Figure 2 presents the distribution of RMSExy 
obtained over 40 iterations of double cross-validation. 

Analyzing Figure 2, one can note that most regression methods present similar 
prediction errors (the median RMSExy is close to 0.22) and only PCR_FS and RF are 
poor choices since their prediction errors are higher than the other methods. In other 
words, the general conclusion is that all methods can predict FCPH equally well, 
except PCR_FS and RF, and the choice should fall on the simpler method, that is the 
method that has a smaller number of model parameters. Thus, the recommended 
method is LASSO as it tends to discard irrelevant variables and produce a sparse 
structure in regression coefficients. In order to identify important variables, 
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Figure 1. Assessing variable importance and samples’ distribution: (a) the Pearson correlation 
coefficient between each predictor and FCPH and (b) the first two principal components from PCA. 
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Figure 2. Distribution of RMSEa4cv in 40 iterations of double cross-validation. 


Figure 3.a presents the regression coefficients obtained for LASSO in the 40 
iterations of double cross-validation. One can observe that speed over ground 
(variable #5) is one of the most important variables, as expected from its high 
correlation with FCPH. Furthermore, arrival draught (variable #16) has, in most 
iterations, the same weight as speed over ground and arrival trim (variable #19) 
consistently contributes to the model. 

In terms of irrelevant predictors, variables #1-4 have regression coefficients very 
close to zero and can be discarded from the model. Lastly, Figure 3.b further 
corroborates the validity of the LASSO models and presents the predicted and 
measured FCPH over all 40 iterations of double cross-validation. Figure 3.b shows a 
good agreement between predicted and measured values, the values fall close to the 
identity line and no clusters are observed. Furthermore, the median coefficient of 
determination over all iterations is 0.88, corroborating the ability of the developed 
models to predict FCPH. 
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Figure 3. Results obtained for LASSO during model training: (a) the regression coefficients and (b) the 
predicted and measured FCPH. 


4 Conclusions 


In this work, we assessed the potential for using operational data collected from a 
Ro-Pax ship to predict its fuel consumption per hour (FCPH). The collected dataset 
covers a period of one year and contained 20 predictor variables. In order to build 
regression models, four classes of regression methods were considered: variable 
selection, penalized regression, latent variable and tree-based ensembles. Within 
each class, representative methods were selected in order to account for a wide range 
of a priori assumptions regarding the distribution of predictors, response variable 
and the relation between them. 

The regression methods were compared using 40 iterations of double cross- 
validation. In each iteration, the root mean squared error (RMSExy) was computed. 
The distribution of RMSE, allows the estimation of the variability in the methods’ 
performance, resulting in a more robust comparison. 

The application of the regression methods to the collected dataset revealed that 
the choice of the regression method is not particularly important since most methods, 
except PCR_FS and RF, presented a similar distribution of RMSEx,. Nevertheless, 
the recommended method was LASSO as it often eliminates irrelevant variables by 
the application of a suitable penalty to the magnitude of regression coefficients. 
Furthermore, the good agreement observed between predicted and measured FCPH 
was confirmed also by the median coefficient of determination over the 40 iterations 
of double cross-validation (0.88). 
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Twitter as a Statistical Data Source: an Attempt 
of Profiling Italian Users’ Background 
Characteristics 


Twitter come fonte di dati: un tentativo di individuare le 
caratteristiche di background degli utilizzatori italiani 


Righi Alessandra and Gentile Mauro Mario 


Abstract Social media (SM) are becoming an important data source about the 
opinions and the sentiment of their users because they allow to capture in real-time 
and in a not solicited way what the users think about a certain topic. In Italy Twitter 
appears to be one of the most used media and it has a greater accessibility and allows 
a more readily text analysis. This paper presents an attempt of profiling the Italian 
twitterers (for research purposes only) carried out at national level using a REST 
API downloading method. This knowledge would allow to better know the 
representativeness of users and, consequently, to correct the strong selectivity of the 
SM users. The technological/statistical approach used and the main results are 
presented. 

Abstract I social media (SM) stanno diventando una fonte di dati importante circa 
le opinioni e il sentiment dei loro utenti perché consentono di acquisire in tempo 
reale e in modo non sollecitato ciò che gli utenti pensano su un determinato 
argomento. In Italia Twitter sembra essere uno dei più utilizzati, ha una maggiore 
accessibilità e permette agevolmente l’analisi dei testi. Questo articolo presenta un 
tentativo di profilazione dei twitterers italiani (solo per scopi di ricerca) svolto a 
livello nazionale utilizzando un metodo di scaricamento REST API. La conoscenza 
di queste informazioni permetterebbe di identificare meglio la rappresentatività 
degli utilizzatori e, di conseguenza, di correggere la forte selettività degli utenti di 
Twitter. L'approccio tecnologico / statistico utilizzato e i principali risultati vengono 
presentati. 
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1 Introduction 


In recent years Social media (SM) are becoming an important data source about the 
opinions and the sentiment of their users because they allow to capture in real-time 
and in a not solicited way what the users think about a certain topic. In Italy 
Facebook and Twitter appear to be the most used media and the latter has a greater 
accessibility and allows a more readily text analysis (Della Ratta et al, 2016). 

Twitter is a microblogger service which let users post 140 characters length tweets 
about anything. Created in 2006, the worldwide service today, according Statista 
Website, averages out at 319 million monthly active users and, according Alexa 
Website, Twitter is the ninth most visited site in the world. 

At national level some estimates, such as the Total Digital Audience provided by 
Audiweb (a private impartial entity monitoring the Internet audience data in Italy) 
and Nielsen, quantify in more than 6.4 million Italians “active” Twitter users. 
Anyway, before using Social media as a real statistical source some challenges 
should be faced. They concern the representativeness and the guarantee of time 
stability of the source and the need to know who the users are in terms of socio- 
economic characteristics. This would allow to correct the strong selectivity of the 
Social media users (Daas et al, 2016). 

Unfortunately, official information about who are the users at national level is not 
available but, as the tweets are associated to metadata, some background 
characteristics information on the of the users and are publicly available. 

This paper presents an attempt of profiling the Italian twitterers (for research 
purposes only) carried out at national level. Twitter data description and the 
technological and statistical approach used are in Section 2; the main results 
regarding the background characteristics (particularly on gender, location and active 
/non active status) of the users making use of the Italian language in their posts are in 
Section 3. 


2 Data and Methods 


Obtaining auxiliary information from units in Twitter is challenging especially 
because data are becoming widely available to researchers to predict financial 
tangibles as well as intangible assets, such as reputation (Bollen et al, 2011). 

A method called ‘profiling’ is an interesting option to do this (Daas et al, 2016). We 
consider only data available on public Twitter and we use a REST API! downloading 
method to get information. 


1 Twitter exposes its data via an Application Programming Interface (API), the REST APIs provide 
programmatic access to read and write Twitter data and give responses in JSON format. The REST APIs 
allow to perform historical search queries on recently posted tweets and to retrieve lists of users, 
followers etc. Access to the API is associated to the personal Twitter account of the developer. Users 
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We used information derived from Socialbakers.com, a worldwide portal providing 
real-time statistics on twitterers with the largest audience in each country (in general 
terms and by specific topic, e.g. sport, entertainment, etc.). We started the 
downloading of the profiles of the followers of the Top ten Italian most popular 
twitter accounts, then continuing with those of the top accounts of each category ina 
decreasing order until when the finding of new profiles became increasingly limited 
and difficult. Besides that, for each downloaded user, we have downloaded the most 
recent tweets to evaluate the active status. 

Downloaded data is thus divided in two different logic groups: on the one hand, we 
have some attributes supplied by the user, as user’s name, nickname, a biographic 
description and the number of users that respectively follow and are followed by the 
user (giving indication of the popularity of the user). On the other hand, the full text 
of the tweets and some other information, as time and date of creation of the tweet, 
the place from where the tweet has been posted). 

The user’s name does not necessary display a person’s name, on the contrary, both 
the user’s name and the nickname can be filled with fancy names or brands. There 
are no specific variables indicating gender, age, or status of the user. 

As for the search strategy, scripts have been developed in Python extensively 
exploiting Tweepy library. 


3 Main results 


Using the developed software, we obtained a 11 million sample of unique twitter 
usernames of Italian speaking users (the work is still in progress). 

Our first goal with this data was to set the active/ non active status of the users 
defining as “active” someone having posted at least a tweet in the last four months. 
In this processing we used a subset of the first 3.7 million users' profiles downloaded 
and we verified that a very high proportion of users has never posted a tweet (around 
1/3 of the sample) and only 30% of people posting tweets has sent at least one tweet 
in the last 4 months. Nevertheless, it was impossible to distinguish the proportion of 
users who just passively follows the exchange of tweets of other users, a mode of use 
which seems to be widespread enough. 

The second goal was to try to define the user’s gender as no one direct information is 
in the user’s profile. Thus, we tried to set it from other information supplied by the 
user, mainly from the name, the nickname and the bio description (unfortunately, 
available for one out of four users only). 

In the first attempt we used a wide list of the most frequent male/female Italian 
names to compare it with the users names to set the gender. Due to the presence in 
the sample of many foreigners living in Italy and writing tweets in Italian, we added 
a subset of foreign names to our list (around 1,500 names). In this way, we could 


have to request a personal key for accessing the API. Moreover, the REST API poses limits in the number 
of requests that can be issued from a same user and in the number of the tweets that are returned. 
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assign a gender to around 68% of the users in the sample, calculating a share of 56% 
men and 44% women among the twitterers. We found also 8% of ambiguous cases 
and 23% of unclassifiable users. 

Willing to complete the picture, a machine learning (ML) algorithm aimed at 
assigning a gender to the unclassified/ambiguous cases was developed, using the 
contents of two profile’s fields “name” and “bio description”. After having 
normalized texts (with text mining techniques as tokenization, elimination of stop 
words, punctuations, etc.) we weighted each term occurrence through an information 
retrieval technique called term frequency—inverse document frequency (Tf-idf) to 
give more relevance to terms containing highest discriminatory information and, 
then, we applied a ML algorithm for Logistic Regression. The algorithm has been 
tuned through Python GridSearch object to find the optimal set of parameters for the 
logistic regression model using a 5—fold stratified cross-validation. we used the n- 
gram range (from unigram to trigram), the applied regularization (none, L1, L2) and 
the regularization strength C as tuning parameters. 

We divided the sample in two sub-set: the training set (with information relating to 
70% of the users for whom we determined the gender) and the test set (with the 
remaining 30% of users). The best model found through GridSearchCV was applied 
to the test set and it led to a 75% accuracy score for our general imputation. Using 
this ML process we were able to predict the gender for a part of the subset of 
unclassified/ ambiguous cases, but some unclassifiable cases still remains (21% of 
the sample). It should be further investigated on these cases for understanding if this 
sub-set is composed of brands, associations or what else. 

Anyway, this second process of imputation has slightly diminished the share of 
males in the sample to 55%. 

An evaluation of the use of the self-declared professional status in the profile’s bio 
description with information on to define the user’s occupation was performed also. 
Unfortunately, the bio description item (containing information on the user's activity) 
is filled in 25% of cases only, even though some information could be erroneously 
contained in other profile’s items (e.g., location, screen name). Even in a lesser 
percentage of cases the self-declared professional status is reported in the text of the 
tweets in a meaningful way. The most recurring professional conditions are students, 
journalists, architects, photographers, and even managers. It seems, however, 
difficult to get statistical information on a large scale in this way. 

Even the “location” of the user in the Twitter profile is filled in a limited number of 
cases (24%) and the text expresses mostly the concept of domicile / place of activity 
in a playful way (e.g., "a place in the world" or " around the world”) or in an 
unspecified way. In order to properly use these data (expressed as names of 
geographical places or even as geographic coordinates) a ML approach is needed 
trying to search for the geographical terms in various profile’s items. Moreover, 
user’s localization might be inferred from the tweets. However, only 15% of 
downloaded users allows the geolocation of tweets. 
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4 Conclusions 


As the Social media access may cover a selective part of the target population, 
auxiliary information explaining the missingness of some sub-populations should be 
used to quantify and correct for selectivity. This work showed that among the 
commonly used auxiliary variables using the user’s name and the short biography 
only the gender of the user can be determined for the entire sample with an 
acceptable degree of reliability. The result is that males are overrepresented among 
the Twitter’s users. Even though possible distortions due to the download method 
can occur: starting from the most popular twitter accounts followers, in fact, the 
profiles of those who are more “isolated” in the social network could be achieved 
with greater difficulty or even not achieved. 

This work is still in progress in order to complete the sample of Twitter’s users but it 
seems to be promising. Among the open issues for future work there is the extension 
of the search for the professional status and the location to other profile’s fields 
where this information could be unexpectedly found (due to the errors in the entry), 
or using the texts of the tweets. Moreover, a clear detection of the number of firm’s 
accounts is also needed. 
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Quality issues when using Big Data in Official 
Statistics 


Aspetti di qualita statistica quando si usano i Big Data 


nella Statistica Ufficiale 


Paolo Righi, Giulio Barcaroli, Natalia Golini 


Abstract The use of Big Data (BD) for improving the statistics and reducing the costs 
is a great opportunity and challenge for the National Statistical Offices (NSOs). Often 
the debate on BD is focused on the IT issues to deal with their volume, velocity, 
variety. Nevertheless, the NSOs have to be assured that the estimates have a good 
level of accuracy as well. This paper evaluates when estimators using Internet web 
scraped variables from a list of enterprise websites, suffering from selectivity 
concerns, are competitive with respect to a survey sampling estimators. A Monte 
Carlo simulation using a synthetic population based on real data is implemented to 
compare predictive estimators based on BD, survey estimators and blended estimators 
combining predictive and survey estimators. 


Key words: Big Data, sampling estimation, selectivity, Big Data quality framework 


1. Introduction 


The opportunities of producing enhanced statistics and the declining budgets, make 
using Big Data (BD) in National Statistical Offices (NSOs) appealing. Often the 
debate on these sources is focused on volume, velocity, variety and on IT capability 
to capture, store, process and analyze BD for statistical production. Nevertheless, 
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other features have to be taken into account, especially in the NSOs, such as veracity 
(data quality as selectivity and trustworthiness of the information) and validity (data 
correct and accurate for the intended use). Veracity and validity affect the accuracy 
(bias and variance) of the estimates and, therefore, question if high amount of data 
produces necessarily high quality statistics. This paper evaluates when the estimators 
using Internet as BD source and suffering from selectivity concerns, are competitive 
with a survey sampling estimator. Design based estimators [2,3] and supervised model 
based estimators [5] using scraped data are compared (Section 2). A simulation study 
based on real 2016 Istat “Survey on ICT usage and e-Commerce in Enterprises” data 
(ICT survey) has been carried out. A synthetic enterprise population with websites 
has been built up (Section 3.1). Target and scraped from the website variables have 
been generated according to the distributions observed in ICT survey. Section 3.2 
describes the set-up of the simulation. The performances of the estimators are shown 
in terms of bias, variance and mean square error (Section 3.3). Section 4 is devoted to 
short conclusions. 


2. Notation and sampling strategy 


Let U be the reference population of N elements and let Ug (d = 1, ..., D) be an 
estimation domain, where the Ug’s partition U. Ug is a sub-population of U with Na 
elements, for which separate estimates are calculated. Let y, denote the value of the 
interest variable attached to the k-th population unit (k=1, ..., N). The parameters to 
be estimated are Yy = Yxeu, Ye and Y = Xkeu Yr- 

For defining the estimation procedure let us introduce a further partition of U. Let U” 
(v=1, ..., V) be a sub-population of size N” that distinguish itself for the set of 
auxiliary information, for instance a sub-population in which auxiliary variables from 
BD source are available. Let x} be the auxiliary variable vector from BD source and 
Zz% be the auxiliary variable vector known from the frame list for unit k. For simplicity 
Ze = z,Vv ,V=2 and if k € U! the vector (xf, z, ) is known, while for v = 2 only 
Zx is known. Then the totals Za = Yxeuy Zx are known. The U”’s cross cut the Uq’s 
, then Ud = Ug NU”. We assume known the totals Z3 = Lxeug Zr- 

In the sampling strategy, Yg is observed with a random sample s of size n. The sample 
could be affected by non-response. Let r be the number of respondents in s and let rg 
and r” be respectively the number of respondents belong to Ug and U”. In the 
observed sample, we can estimate a model $, = f (X%, Zk) for predicting the y 
variable. Table 1 introduces the estimators Y of Y that are compared in the simulation. 
The derivations of the È, of Y4, are straightforward. 


The list of estimators is not exhaustive but broadly maps possible estimators. 
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Table 1: General description of the estimators used in the simulation. 


Estimator Expression Description Note 
Y= Db, 

Mod1 Lut-ryVe k+ bg = N/(Nt+r-r*) Model based est. 
AS: iii IA lei ei 
Y= 1_17,W,+ wx calibration [3] of b,’s NERO 
Mod? Zu r)YxWk defined in Mod] being PZ model 

+ÈrYkWk LraZxWx = Za Yd sea 
o by is the sampling basic Horvitz-Thompson est. 
sn DZ weight corrected by no-response _ 
wp calibration [3] of by’s 
Des2 Y= ÈrykWk defined in Des1 being Calibration est. 
"dii iii ae 
Comb Y = Lwt-ry)Ix + Èriykt b, is the sampling basic Combined est. Modl and 
1 + (2/T)Y rr Ve by weight Des1 


f= i sipe + Yet w, calibration [3] of b,’s ; 
Comb Zu -r1)Yk Vr1Vk defined in Deel being Combined est. Mod1 and 


2 + XY r-rYxWk Vig-rbyZxWx = Z3 Yd 


3. Simulation study 


Accuracy of statistical estimates is traditionally decomposed into bias (systematic 
error) and variance (random error) components. While variance can be estimated, bias 
is not observable if the parameter of interest is unknown. 
We studied the accuracy of a set of estimators via Monte Carlo simulation. A synthetic 
population based on the 2016 ICT survey data has been created. The estimators have 
been taken into account can be distinguished with respect to: 
a. the origin of the exploited auxiliary information, coming from the frame list, 
from a BD source or both; 
b. the inferential approach (design based, model based and a combination of 
both). 


3.1 Target population 


We consider the set of the Italian enterprises with 10 to 249 employed persons in 
activities of manufacturing, electricity, gas and steam, water supply, sewerage and 
waste management, construction and non-financial services (near 180,000 units). The 
population and a z vector of auxiliary variables (location, unit size, and economic 
activity) are identified by the Italian Business Register (BR). 

Currently, Istat uses this register as frame list for drawing the yearly ICT survey. The 
frame list (BR) is updated with information relating to two years before the survey 
time reference. Among the target estimates of the ICT survey there are a number of 
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characteristics related to the functionalities of the websites: for instance the presence 
of online ordering (e-commerce) or job application facilities. The simulation focuses 
on a single binary variable i.e. e-commerce, denoted as y variable, being y, = 1 if 
unit k does e-commerce and y, = 0 otherwise. The target parameters are the count of 
Yx = 1 at domain of level (type of economic activity by size class of employed 
persons), Ya (d = 1, ..., 16) and total level, Y. In particular, the type of economic 
activities are denoted as M1, M2 M3 and M4 and the size class of employees are 
denoted as cll (small), cl2 cl3 and cl4 (large). Since the survey estimates show that 
about 30% of BR units have not website we exclude these units from the analysis and 
remaining units define the target population U. The discarded units follow the 
distribution observed in the 2016 ICT survey in the 16 domains. We note that in 
practice the size of U should be treated as random. The y variable is unknown in U, 
so we create the probability p(y, = 1) for each unit by means of logistic model, 
logit(y,) = æ + zx (hereinafter denoted as true model) where aand Bp’ = 
(B1,- Bar +» P16) are known regression coefficient and Zp = (Zp, +++ Zaks +) Z16k)> 
being Zg, = 1 ifk € Ug and zg, = 0 otherwise. We fix æ and B such that, the sum 
over the Ug's of p(yx = 1) reflects observed distribution in the last 2016 Istat ICT 
survey (Table 2, column p). 
The population U is partitioned in 3 sub-populations, W+, W? and W? : 
e WÌ, the enterprises with website address (URL) available; 
e W7?,the enterprises with wrong URL or website not allowing automatic 
scraping; 
e W?, the enterprises having website but the URL is not available; 
We generated the distribution in the 3 sub-populations following the evidences: 
e Istat has got a second list of business units where the website address (URL) 
is available. The inclusion in the URL-list is on volunteer basis and it does 
not cover all the business register (101,000 enterprises, Wt U W?); 
e in a concrete application of automatic web scraping procedure 68,676 
websites have been investigate (W+) and 32,320 have been not (W°) . 
We assume the URL-list suffers from selectivity problems, that is the distribution of 
target variable within the URL-list (W+ U W°) differs from the distribution of the unit 
out this list, W?.This reflects the hypothesis that if an enterprise uses actively its 
website for business (for instance doing e-commerce) then it has interest to increase 
its reachability, and therefore the probability to be in the Url-list. Table 2 shows the 
sizes and the expected p(y, = 1) for the 3 sub-populations. 
The simulation works with Ut = W+ and U? = W? u W°. 
For completing the synthetic population we generate the output of the web scraping 
so that Internet is the BD source of the simulation. 
The automatic scraping is not able to observe the variable y, but instead it collects all 
texts from websites and, in a second step, based on the use of text mining and 
natural processing techniques, relevant terms are detected to play the role of 
predictors (for instance: “add to cart”, “credit card”, “order”, etc.) [1]. We assume to 
observe, at the end of the process, 12 binary variables (presence/absence), denoted 
by the x vector. 
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Table 2: Population size by domains and W+, W? and W? and the related probability of doing e-commerce 


Population Size Expected probability of e-commerce 
Domain wi! Ww, w U p! VA pì Pp 

MI cll 23,519 10,995 11,435 45,949 0.170 0.170 0.048 0.140 
MI cl2 3,146 1,499 1,595 6,240 0.154 0.154 0.023 0.120 
MI cl3 1,873 887 853 3,613 0.218 0.218 0.014 0.170 
MI cl4 922 440 370 1,732 0.333 0.333 0.000 0.261 
M2 cll 1,122 565 578 2,265 0.138 0.138 0.037 0.110 
M2 cl2 237 97 82 416 0.124 0.124 0.027 0.110 
M2 cl3 146 71 84 301 0.151 0.151 0.009 0.110 
M2 cl4 120 53 44 217 0.222 0.222 0.000 0.181 
M3 cll 5,408 2,486 2,992 10,886 0.050 0.050 0.013 0.040 
M3 cl2 382 176 206 764 0.026 0.026 0.004 0.020 
M3 c13 168 78 81 327 0.039 0.039 0.002 0.030 
M3 cl4 65 27 27 119 0.025 0.025 0.000 0.020 
M4 cll 26,525 12,574 11,289 50,388 0.319 0.319 0.103 0.270 
M4 cl2 2,430 1,144 890 4,464 0.379 0.379 0.081 0.320 
M4 cl3 1,527 712 507 2,746 0.396 0.396 0.036 0.330 
M4 cl4 1,086 516 371 1,973 0.396 0.396 0.000 0.321 
Total 68,676 32,320 31,404 132,400 0.235 0.235 0.061 0.194 


We underline that in practical application this number can be much larger. 
Nevertheless, a larger set of variables would only complicate the simulation without 
adding information. “Good” estimates are achieved when the target variable and the 
set of auxiliary variables (large or small) have a strong relationship: this result in high 
levels of performance indicators of models. 

We generate the 12 auxiliary variables according to two scenarios: 


l- weak dependence with the target variable (harmonic mean of precision and 
recall indicators equal to 63%); 

2- strong dependence with the target variable ((harmonic mean of precision and 
recall indicators equal to 96%). 


In particular, the first scenario seems closest to the evidences observed on the real 
2016 ICT data. Scenario 2 remains a benchmark in evaluation analysis. 


3.2 The simulation process 


The simulation implements a feasible and reasonable estimation process. We consider 
a supervised approach, such that the target variable is observed in a sample, for 
instance in the ICT sample. We assume a stratified simple random sampling design 
with four strata defined by the size classes, cll,..., c14. The sample of size n=23,229, 
is allocated with 16,307 units for cll, 1,820 units for c12, 1,061 units for cl3 and 4,041 
units for cl4. Largest inclusion probabilities are assigned to the large enterprises in 
terms of employees reflecting the real sampling allocation. We generate unit non 
respondents, assuming homogeneous response probability in each stratum (cll 
response probability= 0.45, cl2 response probability= 0.88, cl3 response probability= 
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0.95, cl4 response probability= 0.97). The sample of respondents, r, has expected size 
of about 13,800 units (as in the 2016 ICT survey). 

At domain level the sample size is not planned. We had three domain types: Large 
(L), Small (S) Very Small (VS) (see Table 3). 


Table 3: Expected size and e-commerce frequency in the observed sample 


Domain Size e-commerce Type 
MI cll 3.074,09 430,45 L 
MI cl2 845,37 101,37 E 
MI c13 520,21 88,42 L 
MI cl4 1.681,42 438,14 L 
M2 cll 151,53 16,63 S 
M2 cl2 56,36 6,19 VS 
M2 cl3 43,34 4,78 VS 
M2 cl4 210,66 38,07 L 
M3 cll 728,30 29,16 S 
M3 cl2 103,50 2,06 VS 
M3 c13 47,08 1,43 VS 
M3 cl4 115,53 2,34 VS 
M4 cll 3.371,07 910,32 L 
M4 cl2 604,77 193,47 L 
M4 c13 395,37 130,51 L 
M4 cl4 1.914,43 613,66 L 
Total 13.863,04 3.007,00 Total 


The estimation process follows these steps: 
1. Collect the y variable for respondent units with website; 
2. Make the web scraping for the units in U!and collect the x variables; 
3. Model y onx inr?; 
4. Produce the estimate according to a given estimator. 
For estimators Des1 and Des2 (Table 1), steps 2. and 3. are skipping. 
The simulation compares 6 different estimators of Table 1. We note that: 
e Modl, Mod2, Combl and Comb2: k = P(yx = 1) is predicted with a 
working logistic model using the x variable; 
e Desl: uses an incorrect MCAR [4] model for the non-response weight 
adjustment; 
e Des2: calibration performs a correct weight adjustment for non-response; 
e Comb1, Comb2: produce estimates for U! (using Mod1 ) and U?(using 
Des! or Des2); 
e Comb2: calibration performs a correct weight adjustment for non-response 
in U?. 


3.3 Results 


The simulation takes into account the methodological frameworks of the respective 
estimators. For the model based estimators the y variable is treated as random, and 
then selected the sample, the y values change over the iteration. In the design based 
estimator the y values are fixed, and then in each iteration a new random sample is 
selected. The simulation implements 1,000 iterations and computes for each iteration 
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the estimates Y; 4, for the j-th estimators, the d-th domain in the i-th iteration. The 


following statistics are considered for Mod1, Mod2, Des! and Des2: 
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Ya, is the j-th estimator in the i-th iteration of Y = Zeus yx. Table 4a shows the 


model based estimators produce biased estimates for all the domain types. These 
results convey that if we use a predictive model estimated on a sample representing a 
specific population (W+) such model does not fit for the other populations (such as 
W*). Calibration in Mod2 estimator, partially correct the bias. Discrepancies between 
Scenario | and 2 confirm the importance of using a good working model for 
improving the accuracy (bias). Table 4b shows the two design based estimators. 
Focusing on the calibration estimator (Des2), the correct weight adjustments produce 
nearly unbiased estimates but high CV and RRMSE especially for VS and S domains. 


Table 4a: Maximum values of accuracy indicators observed in the simulation for model based estimators 


Domain Type 


Estimator Statistic VS S L Total 
Mod1 CV 112.90 10.54 25.42 1.82 
Scenario! RBIAS 629.80 313.08 74.46 28.47 
RRMSE 632.17 313.26 77.97 28.53 
Mod1 CV 111.24 8.63 25.54 0.65 
Scenario? RBIAS 85.35 44.75 74.72 19.11 
RRMSE 135.34 45.36 77.43 19.12 
Mod2 CV 65.47 10.51 14.75 1.83 
Scenario! RBIAS 628.42 342.75 70.70 27.72 
RRMSE 630.26 342.91 70.82 27.78 
CV 64.65 8.79 14.86 0.67 

Mod2 
Scenario? RBIAS 90.87 55.11 25.63 17.54 
RRMSE 99.44 55.66 26.03 15.56 


Table 4c show the accuracy of blended estimates, combining the model and design 
based estimates. We note that Comb! - Scenario 2 is highly competitive with respect 
to Desl estimators. We underline that both estimators do not adjust correctly the 
weights of the r — rt sampled units. Comparing Comb2-Scenario 2 with Des2 the 


first estimator seems better for S domain, competitive for VS, L and Total domains. 
Table 4b: Maximum values of accuracy indicators observed in the simulation for design based estimators 
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Domain Type 
Estimator Statistic VS S L Total 
CV 142.30 18.08 14.75 1.62 
Desl RBIAS 61.61 -25.88 62.56 -9.36 
RRMSE 153.85 31.57 62.73 9.50 
CV 89.33 23.97 9.25 1.92 
Des2 RBIAS -1.59 -1.68 0.39 -0.02 
RRMSE 89.33 24.03 9.25 1.92 


Table 4c: Maximum values of accuracy indicators observed in the simulation for combined estimators 


Domain Type 


Estimator Statistic VS S L Total 
cli CV 8327 9.95 12.79 48 
i RBIAS 391.99 156.88 41.97 1.46 
cenang RRMSE 399.29 157.19 42.78 2.08 
Saab CV 81.99 10.16 12.961 13 
NOE RBIAS 101.40 -18.48 32.20 -3.82 
SEO RRMSE 130.36 21.09 34.70 3.99 
Gala CV 81.94 12.74 12.59 58 
Spar RBIAS 368.97 16538 25.61 5.39 

RRMSE 373.27 165.80 26.26 5.62 
uu CV 80.64 12.91 12.71 26 
dani RBIAS 63.43 13.79 790 0.11 

RRMSE 102.59 17.72 1497 26 


4. Conclusion 


Big Data represent a concrete opportunity for improving the official statistics. 
Nevertheless, their use has to carefully evaluate. In this paper, we show in a simulation 
that also the use of auxiliary variables coming from the Internet BD source highly 
correlated with the target variable (Scenario 2) does not guarantee enhancement of the 
quality of the estimates if selectivity issue affect the source. Analyse the BD variables 
and study the relationship between populations covered or not by the BD source is a 
fundamental step to know how to use and which framework implement to assure high 
quality output. 
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Indicators for the representativeness of survey 
response as well as convenience samples 


Indicatori di rappresentatività dei dati di indagine in 
presenza di non-risposta o di campioni non-probabilistici 


Emilia Rocco 


Abstract Non-response bias has long been a concern for surveys, even more so over 
the past decades with the increasing decline of the response rates. A similar prob- 
lem concerns the surveys based on non-representative samples, the convenience and 
cost-effectiveness of which has increased with the recent technological innovations 
that allow for collecting large numbers of highly non-representative samples via on- 
line surveys. These two cases may be considered jointly since in both it must be 
assumed that the bias is the result of a self-selection process and, for both, quality 
indicators are needed to measure the impact of this process. In this study we analyze, 
in different scenarios, the combined use of two indicators that have been suggested 
in the non-response context, but which may work as well for convenience samples . 
Abstract La distorsione per non risposta é da sempre una delle principali fonti di 
errore non campionario e ancora di più negli ultimi anni per la crescente riduzione 
dei tassi di risposta. Un’analogo problema di distorsione riguarda i campioni non- 
probabilistici la cui convenienza é aumentata con le recenti innovazioni tecno- 
logiche che consentono, tramite sondaggi on-line, di raccogliere facilmente dati su 
campioni non rappresentativi. In entrambi i casi si può assumere che la distorsione 
sia il risultato di un processo di autoselezione e sono necessari degli indicatori di 
qualità per valutare l’impatto di tale processo sulle stime. Qui analizziamo, sotto 
diversi scenari, l’uso congiunto di due diversi indicatori che sono stati proposti per 
la non-risposta ma possono essere utilizzati più in generale per problemi di autose- 
lezione. 


Key words: response probability, self selection bias, survey partecipation, weighting- 
adjustment methods 


Emilia Rocco 
Dipartimento di Statistica, Informatica, Applicazioni “G. Parenti”, Università degli Studi di 
Firenze, Viale Morgagni, 59 - 50134 Firenze, e-mail: emilia.rocco @unifi.it 


855 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


856 Emilia Rocco 


1 Indcators of non-response bias 


As response rates have declined over the past decades, the statistical benefits of 
probabilistic sampling have diminished. Assuming that a representative sample is 
initially selected, low response rates mean that those who ultimately supply the tar- 
get data might not be representative. Moreover, with recent technological innova- 
tions, it is increasingly convenient and cost-effective to collect large numbers of 
highly non-representative samples via online surveys. 

The main problem caused by non-representative survey data is that estimators of 
population characteristics must be assumed to be biased unless convincing evidence 
to the contrary is provided. This problem influences the data coming from a prob- 
ability sample affected by non-response and the data obtained with a convenience 
sample in the same way. Hence, in both the cases, the same quality indicators may 
be used in order to evaluate the impact of non-representativeness and the same post- 
survey adjustment methods may be used to deal with it. 

In recent literature, various indicators have been proposed as indirect measures of 
non-response bias in surveys. Wagner (2012) provides a taxonomy of such measures 
based on the types of data used to estimate each one. More in detail he describes 
three types of alternative indicators: (1) indicators involving the response indicator; 
(2) indicators involving the response indicator and auxiliary data that are known for 
all sample units and may stem from sampling frame data, administrative data and 
data about the data collection process; (3) indicators involving the response indica- 
tor, auxiliary data and survey data (i.e. the data for respondents). It is well-known 
finding in survey methodology that the only indicator of the first type, the response 
rate, by itself is a poor indicator of non-response bias. Indicators of the second type 
use auxiliary data for predicting the response indicator and provide a single mea- 
sure of the risk of non-response bias for the whole survey, relying on the implicit 
assumption that the auxiliary variables used to create them are correlated with all 
the survey estimates. The fact of providing a single measure for the whole survey 
is a strength of such indicators since allows them to be used as tools for compar- 
ing different surveys and surveys over time, and for a comparison of different data 
collection strategies and modes. However, it is also a weakness, because, a single 
measure of the risk of non-response bias for the whole survey could lead to in- 
correct conclusions for the survey statistics for which the implicit assumption of 
correlation with the auxiliary data used to create such risk measure is not likely to 
be true. Indicators of the third type, which, in addition to the response indicator and 
the auxiliary variables, use the observed survey data are defined at a statistic level. 
Since non-response bias occurs at the level of the statistics, if the models assump- 
tion on which the indicator relies is good, it allows for directly adding information 
about the bias. However the definition of such indicators at statistic level is also a 
weakness of them. Given that most surveys have multiple objectives, there would 
be more indicators that makes the computation process more complex than for the 
other two types of indicators and could lead to potentially different conclusion about 
data-collection strategy. 

In this study, in order to measure the risk of non-response bias, we examine the ef- 
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fectiveness in different scenarios of a prominent indicator of the second type, the 
”R-Indicator” suggested by Schouten et al. (2009), and suggest the combined use 
of this indicator with another one of the third type which relies on the variation of 
respondent means across the percentiles of the response probabilities predicted for 
estimating the R-indicator. 


2 Theoretical framework and notation 


Let U be a population of N units (i = 1,...N), s a probability sample drawn by 
employing the sampling design p(s) and r the set of responding units. Denote with 


e 7; the first order inclusion probability for unit i; 
e 6; the response indicator so that 6; = 1 if unit i responds and 6; = 0 otherwise. 


We shall suppose that the target of inference is a population mean of a sur- 
vey variable taking value y; for unit i and that the data available for estimation 
purposes consist of the values {y;;i € r} of the survey variable and the values 
{x; = (x1,,.xx,i);i € s} of a vector of auxiliary variables that may influence the 
non-response mechanism and/or the survey variable. Moreover, we assume that the 
response mechanism is MAR and that, given the sample, the response indicators are 
independent random variables with: 


pr(d; = li € s,y,x) = p (Xi) = pi (1) 


The basic idea of the R-indicator is that a response subset is representative with 
respect to x when response propensities are constant for x. Relying on this idea, it 
measures the extent to which the response probabilities p(x;) vary as follows: 


Ro=1-2Sp (2) 


where Sp is the standard deviation of the individual response propensities. There- 
fore, it will be higher when the variability among the response probabilities is lower. 
In practice, the response propensities are unknown. However, when auxiliary data 
are available at a sample level, it is possible to estimate them for all sampled units 
and to replace Rp with the estimator: 


5B pA (6:— Ê)? a loĝ 
Rp=1-2 Noid 7 where P= hen (3) 


The response propensities, 6;, are commonly estimated with explicit or implicit 
models linking the response occurrences to the auxiliary variables, for instance, by 
using a logistic or a probit regression model, or the weighting within cell method. 

It is evident from ( 3) that Rp, as already stated for all indicators of the second type, 
provides a single measure on the risk of bias for the whole survey and does not give 
any direct information about the real bias of a single survey statistic. Therefore, in 
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a multi-purpose survey Rp could be a better indicator for some survey statistics and 
a less effective one for others. In fact, in a survey with several survey variables it 
would be unlikely to identify a set of auxiliary variables correlated together with 
the response probability and with any survey variable. In such a situation, Ry could, 
in any case, be a useful quality measure of the survey data collection process as a 
whole. Moreover, when some auxiliary variables, relevant for describing the survey 
population, are available, it is reasonable to ask whether the subset of respondents is 
representative, at least, with respect to these, and Rp can provide the answer. Finally, 
if the model used to estimate the response propensities is correct and Rp has been 
used for adapting the data-collection process in order to achieve a highly representa- 
tive response set, i.e. a value of Rp close to 1, it is likely that the risk of non-response 
bias is negligible even for those statistics that are not correlated with the auxiliary 
variables used to estimate the response propensities. When a low value of Rp is ob- 
tained, more investigations are needed. In fact, as empirically shown in Section 3, 
if a statistic is correlated with auxiliary variables used to estimate Ros then a low 
value of Rp corresponds to a high bias of the statistic. Conversely, for a statistic 
that is not correlated with auxiliary variables used to estimate Rp, the bias may be 
negligible even in correspondence with a low value of Rp. Hence for moderate or 
low values of Rp we recommend to create, in addition to it, an indicator estimated at 
the statistic level for each statistic. A simple indicator of this type is the variation of 
means across the percentiles of estimated response propensities (Olson, 2006). You 
could simply plot these means, or build a synthetic index as the ratio (denoted as È, 
below) of their deviance and the total deviance of the respondents values. 


3 Simulation Study 


Our aim is to to empirically explore the conditions in which Rp is able to predict the 
bias of the unweighted mean estimator of a survey variable and how the variation 
of means across the percentiles of estimated response propensities may be useful 
for identifying its effectiveness. To this end we perform a simulation study by re- 
producing the simulation setting used by Little and Vartivarian (2005), to provide, 
in the set of weighting for an estimate of a survey mean based on adjustment cells, 
empirical proof of the fact that the non-response weighting adjustments are effective 
in reducing bias if the auxiliary information used for their estimation is related to 
both the non-response mechanism and the outcome of interest. 

Simulation setting: 


e xis a categorical variable with 10 categories that identify 10 cells of adjustment; 

e conditional on the sample size, the sampled cases have a multinomial distribution 
over the (10 x 2) contingency table based on the classification of the response 
indicator, 6, and x, with cell probabilities 


pr(6 = 1,x =c) = pr(6 = l)pr(x = c|ô = 1) 
pr(6 =0,x=c) = (1—pr(6 =1))prx=cl6=0) c=1,...,10 (4) 
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given in Table 1 for two marginal response rates, 70% and 52%, and three con- 
ditional distributions of ô given x corresponding to high, medium and low asso- 
ciation between the two variables. 

e The simulated distribution of y given 6 = h, (h = 0,1), and x = c have the form: 


bd =h,x = c] ~ N(Bo + Bix, Bo), (5) 


and three sets of values of (81,07) corresponding to high, medium and low as- 
sociation between y and x are considered and shown in Table 2. The intercept Bo 
is chosen so that the overall mean of y is 26.3625 for each scenario. 

e 10,000 replicate samples of size 400 were simulated for each combination of 
parameters in Tables 1 and 2. 

e for each replica the following estimates have been produced: (1) the unweighed 
mean of the respondents; (2) the response probability, for each unit in the sam- 
ple, using the weighting within cell method means with the cells corresponding to 
the 10 categories of x; (3) Rp and (4) È,, considering the variation of unweighted 
means of the respondents across 5 percentiles of the estimated response proba- 
bilities. 


Table 1 Percent of samples in cell x x 6 
Response Rate = 52% 


association x 1 2 3 4 5 6 7 8 9 10 

xand 6 

High d=1 0.55 1.00 4.01 4.52 5.04 5.55 6.06 6.58 9.14 9.96 
d=0 869 9.00 6.01 5.53 5.04 4.54 4.04 3.54 1.02 0.20 

Medium 6=1 2.77 3.50 4.01 4.52 5.04 5.55 6.06 6.58 7.11 7.62 
d=0 647 650 6.01 5.53 5.04 454 4.04 3.54 3.05 2.54 

Low 6=1 462 5.15 5.21 5.28 5.34 540 545 5.52 5.58 5.64 
6=0 462 485 4.81 4.77 4.73 469 4.65 4.60 4.57 4.52 

Response Rate = 70% 

association x 1 2 3 4 5 6 7 8 9 10 

x and 6 

High d=1 055 3.00 651 7.04 7.55 8.07 8.59 9.11 9.64 9.96 
d=0 869 7.00 3.51 3.02 2.52 2.02 1.52 1.01 0.51 0.20 

Medium d=1 444 5.30 5.81 633 685 7.37 7.88 840 8.93 9.45 
6=0 480 4.70 4.21 3.72 3.22 2.72 2.22 1.72 1.22 0.71 

Low d=1 619 685 691 698 7.05 7.11 7.17 7.24 731 7.37 
d=0 3.05 3.15 3.11 3.07 3.02 2.98 2.93 2.88 2.84 2.79 


The empirical relative bias of the unweighted mean, the median across the repli- 
cations of Rp and the median across the replications of È, are reported in Table 3 
from which we note that: (1) When the association between 6 and x is low, the Rp 
value is high and the bias of the unweighted mean, even thought it decreases with 
the decreasing of association between y and x, is always very low. (2) On the con- 
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Table 2 Parameters fı and 0? for outcome model ( 5) 


association Bi o? 


between x and y 


High 4.75 46 
Medium 3.70 122 
Low 0.00 234 


trary, a low value of Rp does not necessarily mean a high bias of the unweighted 
mean since when the association between y and x is low, the bias of he unweighted 
mean is negligible irrespective of the value of Rp. (3) A value of Ry close to zero 
allows for identifying the situations in which, given a low association between y and 
x, the bias of he unweighted mean is negligible. (4) If the model used to estimate the 
response propensities is correct, the two indicators, considered jointly, allow for dis- 
criminating between the statistics for which the risk of non-response bias is higher 
( Rp is closer to 0 or Ry is closer to 1) from those for which it is lower (Rp is closer 
to 1 and È, is closer to 0). 


Table 3 Summaries of results based on 10,000 replicate samples for each of 18 scenarios 


Response Rate = 52% Response Rate = 70% 

association association emp. Ro Ry association association emp. Ro È, 

x and 6 xandy bias x and R xand y bias 

High High 27.24% 0.43 0.63 High High 19.10% 0.44 0.64 
Medium 21.23% 0.43 0.35 Medium 19.91% 0.44 0.35 
low 0.02% 0.43 0.02 low 0.08% 0.44 0.01 

Medium High 14.76% 0.68 0.64 Medium High 11.32% 0.70 0.67 
Medium 11.48% 0.68 0.38 Medium 8.86% 0.70 0.40 
Low 0.05% 0.68 0.02 Low 0.04% 0.70 0.01 

Low High 2.16% 0.85 0.31 Low High 2.16% 0.86 0.31 
Medium 1.68% 0.85 0.19 Medium 1.68% 0.86 0.19 
Low 0.00% 0.85 0.01 Low 0.00% 0.86 0.01 
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A sampling design for the evaluation of 
earthquakes vulnerability of the residential 
buildings in Florence 


Un disegno campionario per la stesura di una mappa 
della vulnerabilità sismica dell’edificato residenziale della 
città di Firenze 


Emilia Rocco, Bruno Bertaccini, Giulia Biagi and Andrea Giommi 


Abstract The assessment of earthquakes vulnerability of buildings is a key step in 
the analysis of seismic risk of a territory. The aim of this study is the identification 
of an appropriate sampling design for the analysis of the earthquake vulnerability 
of the residential buildings in the city of Florence. In order to identify such a design 
we have considered that the buildings are statistical units selected from a territory 
and therefore could be spatially correlated. Since in these cases it is advantageous 
to select units well spread over the territory, we propose a spatial balanced sampling 
design. In addition to the information on the geographical location of each building, 
the suggested sampling design takes into account other auxiliary information on 
characteristics of the buildings that may affect their vulnerability. 

Abstract La valutazione della vulnerabilità sismica delle costruzioni è un passo 
fondamentale nelle analisi del rischio sismico di un territorio e nella definizione 
di scenari di danno per terremoti di diverse intensità. Lo scopo di questo studio 
è l’individuazione di un appropriato disegno di campionamento per l’analisi della 
vulnerabilità sismica degli edifici residenziali della città di Firenze. Per individuare 
un tale disegno è importante tener conto che gli edifici come tutte le unità statistiche 
selezionate da un territorio sono generalmente correlati spazialmente e questa loro 
caratteristica rende vantaggiosa la selezione di unità ben diffuse sul territorio. Per- 
tanto, suggeriamo di utilizzare un disegno di campionamento spazialmente bilanci- 
ato. Oltre alle informazioni sulla posizione geografica di ciascun edificio, il disegno 
campionario proposto sfrutta anche altre informazioni ausiliarie sulle caratteris- 
tiche degli edifici che possono influenzare la loro vulnerabilità sismica. 


Key words: auxiliary information, balanced sampling, spatial correlation, units 
well spread over the territory 
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1 Introduction 


The seismic hazard of a territory is defined as the frequency and strength of earth- 
quakes that could affect such territory. The consequences of an earthquake, how- 
ever, also depend on the capacity of resistance to the actions of a seismic shock 
of the buildings in the territory. The predisposition of a building to be damaged is 
called vulnerability. More is the vulnerability of buildings to earthquakes, greater 
will be the consequences. Therefore, the assessment of buildings vulnerability to 
earthquakes is a key step in the analysis of seismic risk and in the definition of 
hazard scenarios for earthquakes of different level of intensity. In this framework, 
several researchers of the University of Florence, belonging to various disciplinary 
areas including Statistics, Earth Sciences, Architecture and Engineering, have de- 
fined a research project, denominated SISMED, whose goal is to draft a map of the 
seismic vulnerability of the residential buildings located in the city of Florence. The 
realization of this map should be based on the georeferencing of a seismic vulner- 
ability index calculated for each residential building of the municipality. But this 
census evaluation is unfeasible since the collection for each building of all the data 
needed for calculating its index of seismic vulnerability is complex and reliable esti- 
mates predict that a complete analysis could take about 30,000 man-days. Therefore 
it is necessary to select a sample of residential buildings and the aim of this study, 
which represents a preliminary phase of SISMED, is the definition of an appropriate 
sampling strategy. 

In the following Section we describe the characteristics of the target population and 
of the frame available for the sample selection, whereas in Section 3 we present our 
sampling design proposals. 


2 The spatial population under study 


During last years, the Department of Earth Sciences of University of Florence ana- 
lyzed in details the geological and geophysical settings of the Florence underground. 
These studies revealed which parts of the city area are interested by different levels 
of amplification of the seismic energy due to site effects. The seismic amplifica- 
tion affects the infrastructures. Consequently, the earthquake vulnerability of the 
buildings depends on the substrate on which the building lies, the ground seismic 
response and the building dynamic response. This last is complex to evaluate and 
requires complex site-inspection by technical specialized staff, but it is also highly 
correlated with some characteristics of the buildings, that may be deduced from 
statistical or administrative fonts, such as the year of construction, the type of con- 
struction (masonry/concrete) and their height. 

The number of buildings in the municipality of Florence, observed by the ISTAT 
2011 Census, was 47,509 and about 65% of them has a “residential” destination. 
Residential buildings are mostly in the suburbs of the city, whereas buildings hav- 
ing a “commercial” or “service” use are predominant in the historical center. About 
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58% of the residential buildings was already present at the end of the 19th century 
and therefore passed the test of the last big earthquake in Florence (1985). However, 
during the years, the renovations of them may have modified their predisposition to 
be damaged by a new earthquake. On the other hand, almost 40% of the buildings, 
built from 1895 to 1981, has never suffered the “testing” of an earthquake and most 
of them were built without any anti-seismic regulations. 

For economic and time constrains the study will be initially limited to a sub-area 
of the city. Our target population is therefore limited to the approximately 4,300 
residential buildings lying within the zone just outside of the Poggi ! boulevards 
perimeter and inside to the railway and Arno river boundaries, extended towards the 
highway junction of Florence South (see Fig. 1). 


Fig. 1 Geographical area under study 


3 Sampling spatial populations 


Statistical units selected from a territory are generally spatially correlated, which 
means that nearby units are more similar than units further apart. This is likely 
to happen also for the residential buildings lying in the city of Florence: nearby 
buildings not only share the substrate on which they lie but they often have been 
built in the same period and/or have other common characteristics; thus they could 
be more similar also in their level of seismic vulnerability. It is a well-known finding 
in the literature on sampling from spatial populations that in a situation of this type it 


' The main Florence boulevards are identified by the name of the architect Giuseppe Poggi who 
designed them in 1865. 
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is advantageous to select units well spread over the territory (e.g. Stevens and Olsen, 
2004; Grafstròm, 2012; Grafstròm et al., 2012). A well-spread sample is usually 
said to be spatially balanced. Different types of spatially balanced sampling designs 
have been suggested in literature for sampling populations with spatial trends in the 
variables of interest, for example different types of systematic designs. For many of 
these designs however, it is problematic to select the units with unequal probability 
but there are situations in which the use of equal selection probabilities does not 
appear reasonable. For this reason, we consider a spatial design recently introduced 
by Grafstròm et al. (2012), the Spatially Balanced Sampling through the Pivotal 
Method (SBStPM), that allows to select a spatially balanced sample with equal or 
unequal inclusion probabilities and can be used for any number of dimensions. For a 
detailed description of it we refer to (Grafstròm et al. 2012). The different inclusion 
probabilities may depend on either (1) the type of sampling procedure (for example 
stratification) or (ii) the probabilities may be imposed by the researcher to obtain 
better estimates by including more important units with higher probability. Both 
cases require the availability of auxiliary information. ¿From the 2011 census data- 
set, for each residential buildings lying in the city of Florence we know the following 
auxiliary variables : 


e the year of construction (grouped into the following classes: until 1918, [1919 — 
1945], [1946 — 1960], [1961 — 1970], [1971 — 1980], [1981 — 1990], [1991 — 
2000], [2001 — 2005]); 

e the number of storeys above ground 

e the type of construction (masonry / concrete). 


All these variables may affect the buildings’ vulnerability, therefore it is opportune 
to select a sample representative with respect to them. Given the categorical na- 
ture of these variables, they cannot be used directly in order to assign to each unit 
an unequal inclusion probabilities proportional to them. However we can stratify 
the buildings, according to these three variables and successively define inclusion 
probabilities equal within each stratum and different between strata following a non- 
proportional rule of allocation. Since the strata are used only to define the inclusion 
probabilities, their number can be relatively high with respect to the total sample 
size which, due to the modest available resources, cannot overcome 150 units. In or- 
der to define the strata we grouped the categories of the variable year of construction 
in 4 classes and those of the variable number of storeys above ground in 3 classes. 
Therefore we obtained 4 x 3 x 2 = 24 strata. The resulting sizes of the strata are 
very different, the most of the buildings is concentrated in few strata and there are 
many strata with few units: three strata with less than ten buildings. For this reason 
we adopted the optimal allocation criterium proposed by Kish (1988), that allows to 
assign higher inclusion probabilities to strata with a small number of buildings. 

Our proposal is to select our sample by means of a “variant” of the SBStPM de- 
sign that ensure the selection of at least one unit in the strata where the sum of the 
inclusion probabilities is at least one. This is achieved updating, at each step, the 
inclusion probabilities though the local pivotal method of Grafstròm et al. (2012) 
for two nearby units belonging to the same stratum. When, in the stratum remains 
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only a units with a probability mass greater than 0 and lesser than 1, the strata are 
collapsed in a hierarchical fashion until a nearby units is found. Other conditions for 
collapsing the strata can be evaluated in order to select samples which meet different 
objectives/needs. 
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A local regression technique for spatially 
dependent functional data: an heteroskedastic 
GWR model 


Un modello di regressione geografico eteroschedastico 
per dati funzionali spazialmente dipendenti 


Elvira Romano and Jorge Mateu 


Abstract In this paper we propose a localized regression technique to account for 
spatial non-stationarity in functional data relationships by generalising a geograph- 
ical weighted regression model. We present an heteroskedastic version of the geo- 
graphically weighted regression model for functional data which allows the residual 
variance to vary across the space. In particular we propose to calibrate the variance 
of the model by replacing it by a continuous mean smoothing over the space. In 
addition, in order to deal with the calibration problem and to define and measure 
the so-called closeness in the spatial functional dimension, this paper proposes an 
alternative back-fitting approach. Several simulation studies and an application on 
real data show the performances of the proposed method. 

Abstract In questo lavoro viene proposto un modello diRegression Geografica Pe- 
sata (Geographically Weighted Regression (GWR)) per dati funzionali spazialmente 
dipendenti.Il modello proposto rappresenta un’estensione del modello di regres- 
sione pesata al caso in cui si assuma la presenza di dipendenza spaziale nella vari- 
abilità degli errori. L’idea di base quella di definire una stima smoothing della 
varianza spazio-funzionale degli errori collocandosi in un contesto puramente fun- 
zionale. Inoltre, dal momento che la procedura di stima locale del modello prevede 
la scelta di una metrica, viene introdotto e generalizzato un algoritmo iterativo di 
back-fitting per ottimizzare tale scelta. Le caratteristiche e le performance della 
metodologia proposta sono state illustrate mediante l’applicazione della stessa a 
numerosi data set simulati ed a dati reali. 


Key words: spatially dependent functional data, Geographically Weighted Regres- 
sion, heteroskedastic 
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1 Introduction 


In this paper we focus on the problem of non-stationarity in the parameter esti- 
mation and propose a generalisation of a geographically weighted regression model 
(GWR) for spatially correlated sample curves[4] obtained as realisations of a spatio- 
temporal stochastic process. 

GWR [1] can be defined as an approach that makes the spatial sub-samples of 
data by means of a kernel function. 

The basis of this model concerns that it looks for local variation in space by 
moving a weighted window over the data, estimating one set of coefficient values at 
every chosen fit point. Thus it follows local representations by modelig the process 
showing directional variation in the spatial distance decay. 

In the functional framework [7], the first and unique attempt to generalise the 
approach of [1] was done by [6]. 

They considered the functional regression model [2] and defined a geographical 
weight in a similar way of GWR. In particular by adapting the basic regression 
model of [1], they incorporate the estimation of the weight matrix into the procedure 
of the estimation of the functional coefficients. A kernel function was used to define 
geographical weight in terms of spatial correlation, and a Montecarlo simulation 
was used to establish the spatial parameter for controlling the spatial variability. 

As in classical GWR it is assumed that the variance of the error term is fixed, and 
spatial weighting function, defined by using the classical Euclidean distance, is ap- 
plied equally at each calibration point. However, in many real cases the assumption 
of constant error variance and the use of the Euclidean distance only for determining 
the weights not be realistic and reasonable. 

To face these potential problems we address its stationary residual variance by 
generalising an heteroskedastic version (H-GWR) [3] to the functional framework 
which allows the residual variance to vary across space. We evaluate the choice of 
an appropriate distance metric by generalising a back-fitting approach in functional 
framework to calibrate a GWR model with parameter-specific distance metrics. 


2 An heteroskedastic GWR model for spatially dependent 
functional data 


Let us assume that we have a functional response variable Y; = {Y;(t),t € T} ob- 
served at a location s € D € R, whose realisation as a function of t € T is a functional 
data, where 7 is a compact subset of R [2]. 

Let {X%s(t),t € T,s € D € R} be a multivariate functional random field. 

Given s; € D,i= 1,...,n, and K functional covariates (with k = 1,...,K) we have 
the realization 


As(t) = [Ces o Any ee eee orl si Rada 
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The aim of GWR is to predict the functional response variable starting from the 
set of functional covariates by allowing local variations in rates of change [1]. The 
model is defined by 


Ys; (t) = Bos; +5 faut Bsa t, si)dv + &,(t) i=1,...,n (1) 


where the function fos;(t) is the mean function at location s;, B,,,(v,t,s;) is the re- 
gression function for the k — th covariates at location s;, and £s, (t) is a random error 
function at point s; 

The calibration process defined as a trade-off between bias and standard error 
and obtained by minimising the sum of integrated square residuals defined by 


n K 
LMISE = Di [¥s;(¢) va Bos; ti D fia (t) Bs (v,1,5;)dv}?dt (2) 


Suppose that the functional data can be approximated by a set of basis functions: 
de (0) = (der (v), Pe (1))T and g(t) = (Pri (t), PGy (t))? and assume that they 
are centered. These can be expanded as %s(v) = CT (v)gx(v), ys(t) = DT (v) p(t), 
Bix (v, t, si) = Qk (Bs, g(t) 

where Cz, D, B;,, are matrices of dimension nxHg, nxHp, HyxHo, with a number 
of basis functions respectively equal to Hy ,Hg. 

Then the 2 becomes 


LMISE = trace{(D _S 698) Jg(D -F aat) (3) 
k=1 
where Jo, = J ely v)ør(v)”dv Jp = f O(t)(t)" dt can be solved by choosing Bs; 
which minimizes the expression 
K 
(Cio, )We (Do Ceo, Bois Io, = (Credo, Ws, DI (4) 
k=l 


Where W,.(nxn) is a diagonal weight matrix with a generic element defined as: 


=d; 
Ws; = Wss = EXP ( mt) ©) 


where ds,,s, is the Euclidean distance between location s; and location sg, and h 
is a non-negative parameter known as bandwidth, selected by a cross-validation 
criterion. The method suffers from the problem that the accuracy of prediction of the 
model for functional data does not improve, although the goodness of fit improves 


by adding the weights in the functional regression model. In addition, the Euclidean 
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distance only for determining the weights may not be realistic because the attribute 
effects of the focal point and its neighbours are totally ignored. 

To face these problems, with the aim of providing a spatial prediction technique 
to deal with the spatial non-stationarity of the functional coefficients, we propose 
a local model that can be seen as local model of variance since the local non- 
stationarity is consequence of spatial variance heterogeneity. 

Borrowing the idea from [3], we introduce a model calibration by means of local 
estimation of the squared residuals. 

Especially we suppose that the variance of the residual model depends on the 
spatial location. 

The GWR prediction variance at a generic location s;, without any assumption of 
spatial dependence, is defined as 


Sawe,, (1) = var{¥s,(t) — ¥s,(t)} = 6° (0) [1 + S5,(0)] (6) 
where: 


e 67(t) = RSS(t)/(n— ENP) , where RSS(t) is the residual sum of squares and 
ENP is the effective number of parameters of the GWR fit. This is a function 
independent from the spatial location. 

e S;(1) are the element of the matrix S = (CyJg,) Ws; (DE, Cro, Br) Jý, 


We propose to calibrate the variance of the model ewe, (t) by replacing o(t) 
with Os; (t). 

If we assume that 0y,(1) is a continuous function over the space, we estimate it 
by a mean smoother. The final variance ô? (t) replaces ô? (t) to give 


Sawn, (t) = var{¥,,(t) = Y,,(t)} = 6s; Pu + Ss; (t)] (7) 


For the local variance estimation, we need to model the relationship with the 
local means. Thus we define by a local smoother the local mean 


n n n L n 
ms,(t) =X wW Sows =X Y walt filsi)/ Sows, SED,tET (8) 
i=1 i=1 i=1 


i=1 1=0 


where f;(-),1= 1,...,L are known functions of the variable s and a;(-),/=1,...,L 
are functional coefficients independent from the spatial location. 

Thus the dependence of the mean from the space is related to the function 
{fi(-)}i=1....,, and to the weights w,,. 

These functions are obtained by the same kernel function specified with GWR in 
order to allow the rate of the spatial variation according to the same criteria. 

Consequently the local variance smoother becomes 


Lo (1) = > Wsi (Ys; (t) — Ms; (t))?/ 5 Wsi (9) 
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It is a mean smoothing over the observed square residuals able to provide the 
following local variance estimation 


n n L n n 
85 (t) = Sows, (v(t) — war) fils) Ss)? / Ss 
i=l i=1 


i=1 i=1 1=0 


The logic is the same to the use of weighted least squares (WLS) in multiple 
linear regression (MLR) to stabilise a non-constant residual variance.The algorithm 
is applied with updated estimates of Bs (v,t,s;) and until an acceptable level of 
convergence is reached. As the parameter estimates, the H-GWR prediction at s; is 
also updated. Assuming that each independent/dependent functional variable pair in 
the H-GWR model may correspond to different optimal distance metrics we propose 
to calibrate H-GWR with parameter-specific distance metrics by a generalization of 
a back-fitting procedure [5]. 

The procedure consists in evaluating different specific distance metrics for esti- 
mating their corresponding parameters and in choosing the one that has the best fit- 
ting value by an iterative procedure. Practically it is performed in three main steps: 
initialize the response variable for several distance metrics; compute the distance 
among the response variable and the model calculated using a specific distance; 
compute the residuals and chose the best model until the residual sum of squares 
converges. It enables to understand the relationship among the variables over space 
as well as over time with major locally-accurate measures of prediction uncertainty. 
Based on minimal assumptions, and as demonstrated in many simulations and real 
data study, the proposed HGWR shows significant improvement over the GWR in 
terms of AIC measures and parameter estimation [8]. 
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Models for jumps in trading volume 
Modelli per i salti nel trading volume 


Eduardo Rossi and Paolo Santucci de Magistris 


Abstract In finance theory the log-price is often supposed to follow an Ito semi- 
martingale while no explicit assumptions are made on the dynamic evolution of 
trading volumes. Trading volume is a measure of the quantity of shares that change 
owners for a given security. The amount of daily volume on a security can fluctu- 
ate on any given day depending on the amount of new information available about 
the company. We assume that the dynamic evolution of trading volume is repre- 
sented as a semimartingale. Analogously to stock prices, the stochastic process for 
trading volume might be characterized by jump components. We distinguish be- 
tween two classes of widely used processes: Brownian semimartingales plus jumps 
and pure-jump models. The relative contribution of each of two components is esti- 
mated by means of alternative nonparametric methods. We also analyze if the jump 
component is a stochastic process of finite or infinite variation. Finally, alternative 
parametric models are estimated and compared. 

Abstract Nella teoria della finanza si assume che il processo stocastico del log- 
prezzo segua una semimartingala di Ito mentre non sono esplicitate le ipotesi sulla 
dinamica dei volumi scambiati (trading volume). Il trading volume di un titolo 
azionario é il numero di azioni scambiate. L'ammontare di volume giornaliero rel- 
ativo al singolo titolo pu fluttuare ogni giorno in funzione delle nuove informazioni 
disponibili. Si assume che l’evoluzione dinamica del trading volume possa essere 
rappresentata da una semimartingala. Analogamente a quanto si suppone per i log- 
prezzi, il processo stocastico per il trading volume caratterizzato dalla presenza di 
una componente di salto. Nel lavoro si distingue tra due classi di processi: semi- 
martingale browniane con salti e modelli di salto. Il contributo relativo di ognuna 
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delle componenti stimato con tecniche non parametriche. Si indaga anche se la 
componente di salto un processo stocastico di variazione finita o infinita. Infine, 
sono stimati e comparati modelli parametrici alternativi. 


Key words: Trading Volume, Jumps, Activity level, Infinite variation. 


1 Introduction 


For each market equilibrium we have an equilibrium price and quantity. In finance 
theory the price is often supposed to follow an Ito semimartingale while no explicit 
assumptions are made on the dynamic evolution of trading volumes. Trading vol- 
ume is a measure of the quantity of shares that change owners for a given security. 
The amount of daily volume on a security can fluctuate on any given day depending 
on the amount of new information available about the company, whether options 
contracts are set to expire soon, whether the trading day is a full or half day, and 
many other possible factors. Of the many different elements affecting trading vol- 
ume, the one which correlates the most to the fundamental valuation of the security 
is the new information provided. This information can be a press release or a regular 
earnings announcement provided by the company, or it can be a third party commu- 
nication, such as a court ruling or a release by a regulatory agency pertaining to the 
company. The news release can generate large variations in the trading volume. The 
trading volume can be measured instantaneously for each trade or cumulated for a 
given time interval. This implies that for longer time intervals the trading volume is 
an increasing process. This is not the case for the price process. 

As in the case of prices, we assume that the dynamic evolution of trading volume 
is represented as an Ito semimartingale (SM) defined on a filtered probability space 
(Q; F; (F Jicjo,r]; Z) satisfying usual conditions, evolving as 


t t 
X, =Xo4 [bs f ods +) AX, 


S<t 


where 
AX; = Xs 4 Xs— 


is the size of the jump at time s. Even when the whole path of X is observed over 
[0,7] one can infer neither the drift nor the Lévy measure. With a finite T we can 
only infer the behavior of the Lévy measure near 0. For a semimartingale the activity 
index takes values in the interval [0,2]. For a Lévy process the jump activity index 
coincides with the Blumenthal-Getoor index of the process [1, 2]. The index takes 
its values in [0;2] and allows to distinguish different classes of stochastic processes. 
The Blumenthal-Getoor index is zero for finite activity jump processes (which have 
finite number of jumps in any finite interval) and it is equal to two for continuous 
(local) martingales. Stochastic processes with Blumenthal-Getoor indices in (0;2) 
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are infinitely active pure-jump processes, with paths of infinite variation if and only 
if the index is larger than unity. When the process is the result of the sum of a 
jump component and a continuous process driven by Brownian motion, its activity 
index will take a value of 2 independently from the activity of the jumps. In general, 
the jump activity of a superposition of different Ito semimartingales is equal to the 
Blumenthal-Getoor index of the most active component. If X is a stable process B 
is also the stable index of the process. B captures the level of the activity: when B 
increases the (small) jumps tend to become more and more frequent. 

The main research question of this paper is: which process best approximates 
the trading volume dynamics? In other words, we want to distinguish between 
two classes of widely used processes in modeling the dynamics of financial prices: 
Brownian semimartingales plus jumps (with Blumenthal-Getoor index equal to 2) 
and pure-jump models (with Blumenthal-Getoor index less than 2). The study of the 
trading volume (TV) dynamics allows to better understand the role played by small 
and large jumps in equilibrium and on the microstructure of financial markets. 


2 Which jumps in trading volume? 


We assume that the observations are collected at a discrete sampling interval A,, 
which means that there are [T /A,] observed increments of X on [0, T], i.e. 


APX = Xia, — X(i-1)A,- 


Let the jump measure of X and v its predictable compensator, Lévy measure. 
Both positive measure on R+ x R. 


Small jumps = {5 Jixj<eX(H — v)(ds, dx) 
Big jumps = fé Jij.e xu (ds, de) 


where the cutoff level € > 0 is arbitrary, but fixed. A SM will always generate a 
finite number of big jumps on [0, 7] but it may give rise to either a finite or infinite 
number of small jumps, i.e. 


v([0,t] x (-00,-£)U(£,+00)) < œ% 


whereas 
v([0,1] x [-e, €}) 


may be finite or infinite. 
Using the methodology of power variation: 


T 
V(p) = | loas 
0 
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Jp) = E |X? p> 0. 


S<T 


1. V(p) is finite Vp > 0, and V (p) > 0 on QW. 
2. J(p) is finite if p > 2 but often not when p < 2. 


The realized power variations proposed by Ait-Sahalia & Jacod [2], 


[T/An] 


B(p, tn, An) = È |A X|’ lijar] <u} 
i=l 


where u, is a sequence of truncation levels. With T fixed, the asymptotics are all 
with respect to A, + 0. Since u, has to converge to 0, un = aA®, © € (0,1/2), and 
a > 0. With © < 1/2 we keep all the increments that mainly contain a Brownian 
contribution. The in-fill asymptotics: 


p>2,VX = B(p,-,An)J(p) 
AI P12 
n 


Yp, onQ7 = 


mp B(p, 2°, An)V (p) 


Mp is the pth absolute moment of z ~ N(0,1). When p > 2 
B(p,co,An) > J(p) 


the jump component dominates. If there are jumps the limit J(p), > 0 is finite. If 
there are no jumps, X is continuous, then 


J(p)=0 B(p,,A,) 30 


at rate A? 1271. We can exploit the different asymptotic behavior of B(p, un, An) by 
varying the tuning paremeters: 


1. the power p: to isolate either the continuous or jump components or to keep both. 


e p< 2 emphasizes the continuous component 
e p> 2 accentuates the jump component 
e p=2 equal treatment 


2. the truncation level u,. The assumption is that there exists a finite number of 
large jumps with fixed size. As A, +, un becomes smaller than the large jumps 
which are thus no longer part of B(p, un, An). Alternatively, we can truncate to 
eliminate the Brownian component using the upward power variation 


[T/An| 


U(p,un,An)= Y AFX Piatu} 
i=] 
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3. the sampling frequency A,. Sampling at different frequencies we can distinguish 
three cases based on the asymptotic behavior of the ratio 


B(p, un, kAn) 


k>2 
B(p,un, An) 


As An + 0, the limiting behavior can be 


1 B(p,un,kAn) 
` B(p.un-An) 
2 B(p,un,kân) 
` B(p,un;An) 
3 B(p,un,kân) 
` B(p,un,An) 
The model includes three components: a continuous part, a small jumps part and 

a big jumps part. Accordingly we can describe the possible behavior by means of 


sets defined pathwise on [0, T] 

Os = {X is continuous in [0,7]} 

al = {X has jumps in [0, 7]} 

dh = {X has finitely many jumps in [0,7]} 

Q7 = {X has infinitely many jumps in [0, T]} 
Qy = {X has a Wiener component in [0,7]} 
Q} ew = {X has no Wiener component in [0,7]} 


= 1, B(p,un,KAn) converges to a finite limit 
< 1, B(p,un,KAn) diverges to infinity 
> 1, B(p, un, kA,) converges to 0 


SA ar 


We should also note that we observe a time series originating in a given unobserved 
path in Qr and wish to determine in which sets the path is likely to be. Any such 
time series can be obtained by discretization of a continuous path and also of a 
discontinuous one. 

The jump activity index at time 1 is the random number (see [1]) 


Bi=intlr>0: f (ADE < >} 
Following [3] un = @A® and uj, = a’ A® 


ja log(U (0, un, An) /U (0, Yün, An)) 
log (y) 


y=a'/a 

By using the statistic U, which simply counts the number of large increments, de- 
fined as those greater than @A,, we are retaining only those increments of X that are 
not predominantly made of contributions from its continuous semimartingale part, 
which are Op (An / 5; and instead are predominantly made of contributions due to a 
jump. When X has only finitely many jumps, the index is B = 0 and U (p, un, An) 
converges to the number of jumps between 0 and f, irrespective of the value of a, 
so B =0 for all n = [T / An] large enough. 

The paper presents and discuss the results of the techniques shown above to high- 
frequency data of SPY and individual stocks traded on the NYSE. 
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On a failure process driven by a self-correcting 
model in seismic hazard assessment 


Un processo di rotture guidato da un modello 
self-correcting nella valutazione della pericolosita sismica 


Rotondi Renata and Varini Elisa 


Abstract Two widely noted features of earthquake generation process are the fol- 
lowing: a) earthquakes tend to occur in clusters, and b) fault ruptures that gener- 
ate earthquakes decrease the amount of strain present along the fault and hence the 
probability that another shock occurs in the near future. These diametrically opposed 
features have been widely studied separately in the literature by two classes of mod- 
els: self-exciting and self-correcting models. To reconcile these contrasting trends 
we propose a new stochastic model which distinguishes strong events - leaders - 
from those of lower magnitude. The former follow a stress release model; condi- 
tioned on their occurrence, the remaining events constitute a set of ordered times of 
minor ruptures occurring in the time interval between two consecutive leader-events. 


Abstract Due ben note caratteristiche del processo di generazione di terremoti sono 
le seguenti: a) i terremoti tendono a verificarsi in clusters, e b) le rotture di faglia 
che generano terremoti riducono lo sforzo presente sulla faglia e quindi la prob- 
abilita che si abbia un altro evento nell’immediato futuro. Queste caratteristiche 
diametralmente opposte sono state ampiamente studiate separatamente in letter- 
atura attraverso due classi di modelli: self-exciting e self-correcting. Per conciliare 
queste tendenze contrastanti proponiamo un nuovo modello stocastico che distingue 
eventi forti - leaders - da quelli di magnitudo inferiore. I primi seguono un modello 
di rilascio di sforzo; condizionato al loro accadimento, gli eventi rimanenti costi- 
tuiscono un insieme di tempi di rotture ordinati che si verificano nell’intervallo di 
tempo tra due eventi leader consecutivi. 


Key words: point processes, bathtub shaped hazard function, generalized Weibull 
distributions, Bayesian inference 
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1 Stochastic point processes in seismology 


Earthquakes are natural disasters by far the most powerful on the Earth. Each year 
thousands of earthquakes are recorded; among these about 60 are classified as able 
to cause fatalities and remarkable damages and about 20 are of major intensity with 
magnitude larger than 7. The potential cost of earthquakes is growing because of 
increasing urban development in seismically active areas and the vulnerability of 
older buildings, which may not have been built or upgraded to current building 
codes. Generally only point-information - origin time, epicentre, magnitude - is 
available for each earthquake; this idealization enables us to study the earthquake 
process through point process models. Explorative analysis of historical catalogues 
collecting the seismic activity recorded in countries like China, Greece, Italy, over 
centuriers shows evidence of clustering in space, time, and size domains. Clustering 
in time is primarily, but not only, associated with the increase of seismic activ- 
ity immediately after large earthquakes leading to aftershock sequences. This phe- 
nomenological aspect implies that the occurrence probability should increase imme- 
diately after an earthquake, to then decrease; this is a specific property of the class of 
self-exciting models, like the widely applied Epydemic-Type Aftershock Sequence 
(ETAS) model. As for physical models on earthquake generation processes, the elas- 
tic rebound theory by Reid was the first theory to satisfactorily explain earthquakes: 
the far field plate motions cause the rocks in the region of the locked fault to ac- 
crue gradually elastic deformation. When the accumulated strain is great enough to 
overcome the strength of the rocks, an earthquake occurs. According this theory the 
occurrence probability should lower immediately after a strong earthquake to then 
increase, since the next event will happen only when enough stress will be built 
up along the fault. Vere-Jones and others introduced in a series of papers [11, 1] a 
stochastic translation of Reid’s theory - the so-called stress release model - which 
belongs to the class of self-correcting models. 

Each of aforesaid models gathers a feature of the phenomenon but it is not able 
to explain how it evolves in its entirety. Some attempts have been done to marry 
the two contrasting trends in a unique model: the simplest solution was proposed by 
Schoenberg and Bolt [7] who combine the conditional intensities A» (t) and As: (t) 
which characterize trigger and strain-release point process models respectively; in 
this way, since it is unknown which model each event follows, one assumes that ev- 
ery event is generated by both the models and the ratio of the cumulative intensities 
over the time interval (0,7), Ai(T)/[Ar-(T) + As: (T)], represents the percentage of 
events due to the triggering effect, if i = fr, or to the strain-release component, if 
i = st. The large difference between the scales at which the triggering and strain- 
release mechanisms appear to operate may be a misleading element. Another ap- 
proach consists in assuming that the different trends correspond to different phases 
of the seismic activity and the dynamics of their activation times is driven by an 
unobserved pure jump Markov process; in this perspective a seismic sequence can 
be considered as a realization of a series of three marked point processes: Poisson, 
stress release and trigger models [9]. The comparison on simulated datasets shows 
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that about 70% of the events are correctly classified but the model is hardly able to 
fit the abrupt changes of state. 

This leads to think that it is more reasonable to assume that the different be- 
havioural trends (models) are superimposed rather than consecutive. In this perspec- 
tive we suppose that the first level (model) concerns the most amount of released 
energy and guides the remaining seismic activity distributed within the intervals be- 
tween each pair of consecutive I type quakes. The timing of the secondary (or II 
type) events could suggest their classification as aftershocks, isolated events (back- 
ground) and foreshocks, that matches well with a bathtub-shaped hazard function. 
In Section 2 we propose a criterion for partitioning the data into two categories; 
other criteria can be adopted according to the available information. 


2 Superimposed point processes: failure process and 
self-correcting model 


In this study we consider two databases: the Database of Individual Seismogenic 
Sources (DISS, version 3.0.2; [5]) and the Parametric Catalog of Italian Earthquakes 
CPTI04 [4] which reflect the same level of knowledge at the end of 2002. DISS 
is a large repository of geological, tectonic and active fault data for Italy and the 
surrounding areas; in particular it contains 74 composite seismogenic sources (CSS) 
located in Italy. A CSS is essentially an active structure where an entire fault system 
is identified on the basis of geological data. One of the most active sources is the 
CSS-025 located in central Apennines region (Fig. 1 right); among the earthquakes 
of moment magnitude M,, > 4.5 recorded in CPTI04, 50 may be associated with 
this fault system. To guarantee a satisfactory level of completeness of the data set 
we just consider the events occcurred since 1870 and, in particular the set of 35 
earthquakes covering the time interval from 1873 to 1985. Then we partition this 


data set into the set GUMP, of n= 9 leaders events of magnitude exceeding 


the threshold M,, = 5.3, and into the sets (sij MPY, Vi=1,...,n—1, of 26 
subordinates events with t; < sij < ti+1, being t; and s;; the occurrence times and 
M, the respective magnitudes (Fig. 1 left). 

We assume that the leader events follow the stress release (SR) model. This 
model assumes that the stress x increases linearly with time at a constant loading 
rate p imposed by tectonic movements; hence the stress level at time f is given by: 
x(t) = xo +p t — s(t), where xo is the level of stress at the beginning of the analysed 
period and s(t) is the accumulated stress released by the earthquakes in the area 
at times ti, which is s(t) = L;x,<;x;. By the word ‘stress’ we indicate any quantity 
that governs the state of the system; in this case we choose as proxy measure of the 
earthquake size the scaled energy E/Mo, ratio of the energy E and the seimic mo- 
ment Mo, that turns out to be the best measure to use in SR models [10]. Therefore 
we have: 
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Fig. 1 (Left) Time vs magnitude plot of the data: in black the leaders events and in grey the 
subordinates ones - (Right) Map of the composite seismogenic source CSS-025 with the epicentres 
of the leaders (black stars) and subordinates (grey circles) earthquakes. 


E M/A 
Mo Mo 
where A is the rupture area of the earthquake and the seismic moment is linked to 


the moment magnitude through the relation log}, Mo = 1.5 My +9.1. As every point 
process, the SR model is characterized by its conditional intensity function: 


As(t |) = oa sip Dalì. a) 


iti<t 


a monotonically increasing function of the stress level with parameters a, B, p, 
where 7 is the previous history of the process consisting in the set (ti, MË), ti<t, 
Ply ori. 

For the magnitude we adopt the exponential distribution, that in seismology is 
inspired by the Gutenberg-Richter law, logj,gN = a — bM, which expresses the 
relationship between a magnitude value M and the number N of events of at 
least that magnitude. In our case we assign the exponential distribution g(M,,) = 
b exp{—b (Mw —mo)} on the interval [mo, +œ) with mo = 4.5, so that the density 
function of the magnitude of the leaders events is: 


= oe My 
s(Ma|M 2 My) =D (Mo Ma) = ETT, Myc Mito). O 
Given the interval (t;,t;41),i= 1,...,n— 1, let us consider the number J; of subordi- 


nates events (si; MË), j=1,...,Ji, such that t; < sij < ti+1 and MË < Mir. If we 
indicate the probability of exceeding the magnitude threshold by p = 1 — G(M,,), 
then J; can be meant as the number of failures (events of Mẹ < M,,) before the 
next success (event of My > M;,); hence J; follows a geometric distribution with 
parameter p = exp(—bM,,): 
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Pr{Ji= ji} = p (1- p)", ji=0,1,.... (3) 


The occurrence times s;; of the subordinates events constitute a sample of J; minor 
rupture times Tj; = sij — ti with Tij € (0,ti41 — ti); being Tij < Ti(j+1)> We assume that 
Tij, J =1,...,Ji, are the order statistics of a random sample drawn from the density 
function f(-). The length of the time interval between two consecutive strong earth- 
quakes is managed by the stress release model, but we think that the process of sec- 
ondary ruptures is the same once that their times are unit-based normalized in (0, 1), 
that is Tj = pri are identically distributed Yi = 1,...,n— 1, and j = 1,..., Ji. 

To be able to fit any trend, the probability distribution of 7;; should have so flex- 
ible hazard function as to exhibit not only monotonic shapes, but also unimodal, 
bathtub and modified bathtub (or N-shape) shapes. The Weibull distribution is one of 
the most cited lifetime distributions in reliability engineering and other disciplines; 
it describes failure times observed in many phenomena, with increasing, constant, or 
decreasing hazard rate. The class of generalized Weibull distributions [6] includes 
many Weibull related distributions with differently shaped hazard function; among 
them we have considered in particular two: the additive and flexible Weibull distri- 
butions. The additive Weibull distribution: 


F(t) =1- e (7/81) e (dB), a, %2, Bi, Bo > 0, (4) 


is a twofold competing risks model that can therefore represents jointly the decreas- 
ing seismic activity after the main shock and the increase of activity before the next 
strong earthquake. Involving two Weibull distributions its hazard function: 


h(t) = (%1 /B1)(t/Bi)™~! + (02/B2)(t/Bo)® (5) 


can be increasing, if both shape parameters are greater than 1 (a > 1 and œ > 1), 
decreasing if œ; < 1 and œ < 1, or h(t) has a bathtub shape if a < 1 and œ > 1. 
The flexible Weibull distribution has the following survival function: 


F(t) = exp(—e?**/*) (6) 
and hazard function: 
h(t) = (y+ 6/72) exp(yt— 6/T). (7) 


Unlike other generalized Weibull distribution, this distribution has rather simple 
hazard rate and Bebbington et al. [2] showed that its shape is a modified bathtub 
(i.e., h is first increasing followed by a bathtub shape) if yô < 27/64. 

Finally the probability distribution of the magnitude of the subordinate events is 
defined on [mo, M;;); hence we have: 


be? (My = mo) 


< ; = . 
8s(Myw|mo <My < My) es = My, _ mo) 


(8) 
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If we indicate by 0 = (œ, 8, p,b, 01, Bi, 2, Bo) (or 0 = (a, B,p,b, y, 5)) the pa- 
rameter vector, the likelihood is given by: 


a n tn n g(MÊ) 
-£(0,data) = i As (ti) sof- fat 7 x LI 01) z 
T [1 — G(Min)] [G(Min) |" x n 


at fr h e t h D 
II E jai (t41—4) F(1) <H E G(M;n) 


i=l 


where the first line gives the probability of (ti, MË), the second the probability of 


the number of subordinates events, and the third the probability of (sij MẸ )), We 
note that the factor J;! is due to the fact that there are J; unordered samples from 
F(t) corresponding to the ordered sequence of observed times sj; in (t;,t;+1), Vi = 
1,...,2-1. 


3 Bayesian estimation 


We estimate the model parameters following the Bayesian approach. According to 
this paradigm, the parameters @ are considered as random variables and our beliefs 
about their variability are formalized through prior distributions. Prior knowledge 
on the model parameters arises generally from the literature and previous experi- 
ence. Our model is formulated for the first time and some parameters are not strictly 
related to easily measurable physical quantities; therefore we assign the prior distri- 
butions according to an objective Bayesian perspective, by combining the empirical 
Bayes method and the use of vague-proper prior distributions [3]. We choose the 
prior distribution of each parameter in agreement with its support, and we express 
the parameters of this prior distribution (called hyperparameters) as functions of 
the prior mean uo and variance corn of the corresponding model parameter so that, 
assigned uo and ora we also have the hyperparameters. According to the empirical 
Bayes method, preliminary values of the prior means are obtained by maximizing 
the marginal likelihood and by setting the standard deviations to 90% of the corre- 
sponding means to avoid that the estimates provided for the variances through the 
maximation are too close to zero. This implies a double use of the data in assign- 
ing the hyperparameters and in evaluating the posterior distributions. To avoid this 
drawback we choose priors that ‘span the range of the likelihood function’ [3], that 
is, by varying the hyperparameters around their preliminary estimates and choosing 
those values that include most of the mass of the likelihood function, but that do not 
extend too far. 

In the Bayesian framework, the estimate of a parameter is typically given by its 
posterior mean, obtained, together with measures of its uncertainty, by its posterior 
distribution. If, through Bayes’ theorem, an explicit formulation for the posterior 
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distribution is not available, we resort to methods of stochastic simulation based 
on constructing a Markov chain that has the desired distribution as its equilibrium 
distribution (Markov chain Monte Carlo (McMC) methods). In this study we have 
applied the Metropolis-Hasting McMC algorithm to generate Markov chains con- 
verging to the posterior distributions of the model parameters. 


4 Results 


Fig. 2 provides a graphical summary of the results; the picture on left shows the 
conditional intensity function of the stress release model (1), whereas the pictures 
on right represent the hazard functions of the flexible (top) and additive (bottom) 
Weibull models respectively. The estimates of these functions have been obtained (i) 
by replacing the parameter estimates in their espressions (plug-in estimate) and (ii) 
through the ergodic mean of the values obtained by replacing each parameter with 
the elements of the respective Markov chain generated by the McMC algorithm. 
This second way of estimation provides a sequence of values at each instant f; e.g., 


for the conditional intensity function (1), one has {A;(t | COREA SE p where R is 


the number of elements of the Markov chain (aOR. Through this sequence we 
can also obtain the median and the quartiles of the pointwise estimate of Às(t). 

Different approaches can be adopted to evaluate the goodness of fit of a model 
to the data and to compare pairs of models; according to the Bayesian approach 
we quantify the evidence in favour of a model through the Bayes factor. Given two 
models .//, ‘a, and the dataset D, the Bayes factor is the ratio of the posterior odds 
of the models to their prior odds. When the prior probabilities pr(.%), k = 1,2, of 
the two models are equal, the Bayes factor coincides with the ratio of their marginal 
(or integrated) likelihoods pr(D | 4), k = 1,2 obtained by integrating (9) over the 
parameter space with respect to their prior distributions. 

The two versions of our model differ in the probability distribution of the sub- 
ordinates rupture time: in case of the flexible Weibull distribution (model ./;) the 
marginal likelihood is equal to -43.39 in /ogio-scale, whereas in case of the addi- 
tive Weibull distribution (model %4) the marginal likelihood is equal to -43.06. 
According to the interpretation of Jeffreys’ scale, the Bayes factor logio Ba,f = 
logio pr(D | a) — logio pr(D | p) 43.06 + 43.39 = 0.33 indicates slight 
evidence in favor of the model .Z, which has a bathtub-shaped hazard function for 
the rupture time. 

We note that aftershocks are lacking in the CPTI04 catalogue according to the 
compilers; we hope for variations, possibly improvements, in the results from the 
application of the model to catalogues with a greater number of secondary events. 
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Fig. 2 (Left) Different estimates of the conditional intensity function A,(t) of the stress release 
model, and (Right) of the hazard function of the flexible Weibull distribution (top) and of the 
additive Weibull distribution (bottom). 
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Functional principal component analysis of 
quantile curves 


Analisi in componenti principali funzionali di curve 
quantiliche 


M. Ruggieri, F. Di Salvo and A. Plaia 


Abstract Literature on functional data analysis is mainly focused on estimation of 
individuals curves and characterization of average dynamics. The idea underlying 
this proposal is to focus attention on other particular features of the distribution of 
the observed data, moving from mean functions towards functional quantiles. The 
motivating examples are functional data sets that are collections of high frequency 
data recorded along time. As quantiles provide information on various aspects of a 
time series, we propose a modelling framework for the joint estimation of functional 
quantiles, varying along time, and functional principal components, summarizing 
some common dynamics shared by the functional quantiles. 

Abstract La letteratura sull’analisi di dati funzionali é prevalentemente rivolta 
alla modellazione e stima delle singole curve aleatorie e alla caratterizzazione del 
momento primo. L’idea di base di questo lavoro é considerare altri aspetti della 
distribuzione dei dati osservati, spostando l’attenzione verso i quantili funzion- 
ali. La tipologia di dati a cui questa analisi si rivolge é rappresentata da dati ad 
alta frequenza osservati nel tempo. Poiché i quantili sintetizzano informazioni sulle 
dinamiche temporali di una serie storica, si propone un approccio per la stima 
di quantili funzionali, in corrispondenza di diversi valori di probabilità, e per la 
derivazione di componenti principali funzionali che ne riassumano le dinamiche 
comuni. 


Key words: functional data, nonparametric quantile regression, penalized splines, 
functional principal components 
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1 Introduction 


Let consider high dimensional data observed at discrete times; although we observe 
a finite number of measured values, they are often analyzed as if they were defined 
in continuous time. 

Traditional analysis concerns the conditional distribution at each time point, 
while in a functional data analysis (FDA) approach each time series is considered 
as a sample generated by a random curve, varying over a continuum; in both cases, 
more frequently the goal is the centre of the conditional distribution or a mean func- 
tion describing the pattern of the set of functions. 

In the univariate regression setting, quantile regression models the quantiles of 
the conditional distribution of the response variable; this is a valuable alternative 
to the conditional mean, when the interest is in the tails of the distribution or in 
presence of model mis-specification (see [4] and [5]). With the increasing demand 
of statistical tools for FDA is therefore natural to try to extend the definition and 
the estimation of quantile functions for infinite-dimensional data. However the ex- 
tension of quantile function to a multivariate setting is not straightforward, because 
quantiles are basically defined by ordering values of a random variable. Since there 
is no natural order for R” when n > 2, there is no obvious extension and a number 
of efforts has been devoted to this problem in the last years. 

Our proposal explore the performance of the multi-way functional principal com- 
ponent analysis (FPCA) when functional quantiles of different order are simultane- 
ously considered. There are some previous works motivating our idea and in particu- 
lar [3] and [1]. An approach on generalized regression quantiles with their synthesis 
by means of a small number of principal components is proposed in [3] in a FDA 
framework. 

Fraiman et al. [1] define directional quantiles and extend a projection-based def- 
inition of quantiles to infinite-dimensional Hilbert and Banach spaces; the authors 
develop a factor analysis based on principal directions and robust principal direc- 
tions. The main results in [1] are based on an intuitive definition of directional 
quantiles, indexed by an order œ € (0,1) and a direction u in the unit sphere; the 
directional quantile describe the behavior of the probability distribution in finite and 
infinite-dimensional spaces; principal quantile directions are defined to summarize 
their information. Moreover, exploiting the idea of statistical depth, they generalize 
the definition of robust principal components for functional data. 

In a previous paper [2] , we estimate multivariate functional data by penalized 
B-spline; a working covariance matrix is also derived on the basis of coefficients of 
the splines, accounting for the main temporal effects; FPCA allow us to project data 
variations, observed in multidimensional space, into few dimensions. Due to dimen- 
sionality reduction and applying the Karhunen-Loéve decomposition, this method 
is also useful for the representation of the random curves in terms of the factor func- 
tions. 

In the present paper our main purpose is to investigate data by means of FPCA, cap- 
turing the tail behaviour of the distributions. The FDA approach is proposed for the 
simultaneous estimation of the functional regression quantiles; assuming that quan- 
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tiles, estimated at different values of probability, share some common features, they 
can be summarized by a small number of functional principal components, identi- 
fying the directions along which resuming the most interesting characteristics. The 
method is applied to air pollution data from a monitoring network. 


2 The Methodology 


The a@-quantile is defined as the inverse of a cumulative distribution function, given 
a real valued random variable X, with distribution F;: 

Qx(a)=:0(Fx, a), inf{x € R : Fx(x) > a}. 

We refer to the situation in which the interest is in the @ — th theoretical quantile 
of the conditional distribution of X at time t: 


Qx (alt) = Q(Fx| @). (1) 


In this setting the (1) is a time varying function: 
Ox)(a|t) = lalt), (2) 


and the estimator is the minimizer of a expected (generally asymmetric) loss 
function. 

Kato [4] is one of the earlier paper studying functional quantile regression; start- 
ing with a linear quantile regression, in which the response is scalar while the covari- 
ate is a function, and expanding the covariate and the slope function in terms of their 
principal components, the model is transformed into a quantile regression model 
with an infinite number of regressors. More recently, in [3] the functional quan- 
tiles /q;(t) are estimated nonparametrically; on the basis of the Karhunen-Loéve 
decomposition, they may be approximately represented by means of an Empirical 
Orthonormal Basis (EOF): 


A 
lai(t) = u(t) TÈ wy" (E(t), (3) 


where u(t) is a mean function and E}; y” (i)¢® (t) is the reduced rank model 
obtained fixing the number H of bases; it is linear combination of principal compo- 
nents y (i) and eigenfunctions é,(t). 

The authors perform the analysis combining the representation (3) with the esti- 
mation procedure of the quantile function, after choosing a proper loss function. 

We generalize this approach to the joint estimation of a collection of quantile 
functions, defined for a relevant set of probability values, & = [@1, Q2,...Q], imple- 
menting a three-mode FPCA analysis together with a general smoothing approach. 

A functional form is presented by means of multidimensional linear smooth func- 
tions: 
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K 

lalt) = Ñ, Of %0), (4) 
k=1 


where: P(t) = {;(t)} is the set of K basis functions and 0 = Let a 6% | is the 
vector of the coefficients. 

In order to estimate the K x N matrix © of parameters, the P-spline (penalized 
B-spline) approach is here considered, minimizing the penalized loss function: 


PENRSS(y) = w(@)I|X — €0%|" |X — P0°|+70%HO°, (5) 
where the elements of vector w(a) = [w1(@),...,Wn(@)] : are: 


e wi(a) = a, if X; > ago"; 


e wi(a) = (1 = a), if X; < apo”. 


For details of penalty term 10°HO® in (5), as well as for the estimation pro- 
cedure, we refer to [2]. In this framework, three-mode functional principal quantile 
are derived straightforward by decomposition of the variance function, estimated 
by a working variance array (referred to N curves, T time units and Q quantiles) 
defined in terms of the estimated coefficients (see also [2]). An interesting result is 
the decomposition of a random function into two sets: the set of factor scores, one 
for each curve, on the basis of all their quantiles, and the set of corresponding factor 
loadings, defining the mood of variations. 


3 The application 


We illustrate the proposed method with an example of PMjo daily time series regis- 
tered in one year in N = 59 stations of a monitoring network in California. 

In Fig. 1 (a) the set of the N observed curves are represented (gray lines); a 
subset of seven curves are selected (coloured lines) in order to highlight the results 
of the procedure. In Fig. 1 (b) — (f) the estimated quantile functions for different 
probability values, from 0.1 to 0.9, synthesize the specific pattern of the respective 
curves. Fig. 2 show the projections of the quantile functions in the space of the first 
two principal components with proportion of variance explained 0.793 and 0.073; 
figures (a) — (e) are the partial scores for each quantile and (f) the total scores. We 
can observe that the functional principal components retain the most information of 
the original curves and curves with similar pattern have similar scores. 
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Fig. 1 Observed curves and estimated quantile functionals 


4 Conclusion 


The FDA approach is proposed for the simultaneous estimation of functional re- 
gression quantiles, when the main purpose is capturing the tail behaviour; assum- 
ing that quantiles estimated at different values of probability share some common 
features, they can be summarized by a small number of functional principal com- 
ponents, identifying the directions along which resuming the interesting character- 
istics. Some implication and appealing intuitions can be borrowed from approaches 
relied on depth measures, in order to construct basic tools for functional data. The 
approach has the advantage of further generalization, such as the inclusion of ex- 
planatory variables and distributional assumptions. Many consequent applications 
of the FPCA in quantile regression are motivated by the Karhunen-Loéve theorem, 
by means of which the random curves find convenient representations in terms of 
empirical orthogonal functions. 
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Fig. 2 Projection of the curves in the space of the first two partial (a)-(e) and total (f) principal 
components 
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Detecting group differences in multivariate 
categorical data 


Ricerca delle differenze tra gruppi in dati categoriali 
multivariati 


Massimiliano Russo 


Abstract In several studies, a group indicator is collected together with a multivari- 
ate vector of categorical variables with main goal in assessing evidence of differ- 
ences of the collected vector across these groups. Similar goals arise routinely, but 
very few general methods which can test for group differences in multivariate cat- 
egorical data are discussed in literature. We address this goal proposing a Bayesian 
model which factorizes the joint probability mass function for the group variable 
and the multivariate categorical data as the product of the marginal probabilities for 
the groups and the conditional probability mass function of the multivariate cate- 
gorical data given the group membership. To provide a flexible and computationally 
tractable model for the probability mass function of multivariate categorical vector 
we rely on a mixture of tensor factorizations, facilitating dimensionality reduction, 
while providing simple and accurate test procedures to assess global and local group 
differences. 

Abstract Jn molti studi, si osserva un vettore di dati qualitativi non ordinali as- 
sociato con un idicatore di gruppo. In questo contesto uno degli obiettivi princi- 
pali é quello di stabilire se esistono differenze significative nel vettore di variabili 
qualitative osservato al variare del gruppo. Simili obiettivi sono presenti in diverse 
applicazioni, ma la letteratura corrente deficita di metodologie generali che pos- 
sano testare le differenze di gruppo in un vettore di dati qualitativi non ordinali a 
diversi livelli. Per perseguire tale obiettivo ci si è basati su un modello bayesiano, 
fattorizzando la probabilità congiunta per la variabile di gruppo ed il vettore di dati 
qualitativi come il prodotto della funzione di probabilità marginale della variabile 
di gruppo e la funzione di probabilità condizionata del vettore di variabili qual- 
itative dato l’indicatore di gruppo. Al fine di ottenere un modello flessibile per la 
funzione di probabilità del tensore si è utilizzata una mistura di tensori che favorisce 
la riduzione della dimensionalità, portando a procedure accessibili per testare dif- 
ferenze globali e locali. 
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Key words: Categorical data; Endgame problem; Hypothesis testing; Tensor fac- 
torization. 


1 Introduction 


Categorical data frequently arise in different applications especially in such areas 
as clinical trials, psychology and social sciences in which many nominal data are 
required. In some applications, as medicine or social sciences, we have a group 
division — case/control, severity of a disease or social classes — with main goal 
being in testing how the whole set of measured variables changes according to the 
group structure. The main aim in our work is in presenting a model to test how 
the dependence structure of a whole contingency table varies across groups. We 
additionally propose local tests to establish which variables are responsible for such 
variation. 

In accomplishing this goal a widely used approach consists in separately testing 
group difference in each marginal via chi-square test, accounting for multiplicity 
by false discovery rate control [2]. This approach does not incorporate dependence 
structure and hence is usually characterized by low power. 

To take into account dependence underlying the data, a possible solution is given 
by nonparametric permutation tests [6], but although presenting a valid alternative 
these methods cannot detect changing that goes beyond marginals, giving inaccurate 
results when changes occurs in higher order structures. 

To overcome this last issue, one possibility is to define a test based on a flexi- 
ble representation for the probability mass function of the multivariate categorical 
data. Analysis of contingency tables is mostly based on log-linear models [1], but 
when the number of variable is even moderately high the set of possible interaction 
become huge, making successive inference a difficult task. 

Recently Dunson and Xing [3] proposed a Bayesian nonparametric methodol- 
ogy which defines the probability measure over a tensor as a mixture of product 
of multinomial distributions, avoiding direct specification of the underlying depen- 
dence structure. The proposed model is computationally tractable, has theoretical 
justifications and has been recently generalized to different frameworks — e.g. [9] 
and [10]. 

While these lasts focus on modeling the conditional probability mass function of 
a univariate response with the categorical data acting as predictors, we consider in- 
stead the dual problem, assessing evidence of group differences in the entire proba- 
bility mass function of a multivariate categorical random variable. In accomplishing 
this goal we factorize the joint probability mass function for the group variable and 
the multivariate categorical random variable as the product of the marginal probabil- 
ities for the groups and the conditional probability mass function of the multivariate 
categorical data. This last is defined via a mixture of tensor factorizations allowing 
a general and tractable formulation which facilitates global testing of group differ- 
ences in the entire probability mass function for the multivariate categorical data. 
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2 Model formulation and testing 


We propose a flexible model for the joint probability mass function py x = {pr(Y = 
y,X=x):y € Y,x € X } underlying the observed data (y;,x;), where yi = (yi1,---,Vip)? € 
Y =(1,...,d1)X-:-X(1,...,dp) denote the observed vector of categorical data and 


x € X = (1,...,k) its corresponding group, for each subject i = 1,...,n. 


Our main goal is to establish if the probability mass function varies across the 
level of X. This hypothesis can be formally stated as 


Ho : py,x(y;X) = py(y)px(x) v.s Hi: py,x(y,x) # py Y)px (x). (1) 


with py (y) = {pr(Y =y) : y € Z } and px (x) = {pr(X =x) :x € 2} the marginal 
probability mass functions of Y and X, respectively. 

In order to develop an accurate test procedure for hypothesis (1), avoiding mis- 
specification issues, a key step is to rely on a representation for pyx which is 
sufficiently general to approximate any possible probability mass function in the 
|Y x 2|—1 dimensional simplex Ajz,9-\. We address this goal by expressing 
Py,x as 


Prx Ox) = Py|jx=x(y) Px) (VEY xE), (2) 


with the conditional probability mass function of Y given X = x factorized as 


H Pp ; 
prx=x0)= È} Vix [] Vin, EXE), (3) 

h=1 j=l 0 
where Vy = (Vix,..., Vax) € Ay are vectors of mixing probabilities specific to each 
group x=1,...,k, while vi is the probability that the categorical random variable 

J 

Y; assumes value y; in mixture component A, for each y; € (1,...,d;),j=1,...,p 
and h = 1,...,H. Under factorization (2) the marginal probability px (x) is the prob- 


ability mass function over a categorical vector having k levels and can be efficiently 
modeled via multinomial distribution. 

Representation (3) for the conditional probability provides a parsimonious model, 
reducing dimensionality. Group-dependence is included in the model only in the 
mixture weights allowing for efficient borrowing of information. Additionally, it is 
easy to show that hypothesis (1) reduces to 


Ho: Vi =... =V; versus Hi: vy AVy forsome x,x’. (4) 


Hypothesis (4) is based on a representation of py x which is general and robust 
against model misspecification (refers to [7] for proofs and additional details) pro- 
viding an accurate solution, moreover it can be directly included in the model via 
suitable prior [5] 
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vy = (1—T)u + Tuy 
u ~ Dirichlet{⁄,..., Yg}, ur Dirichlet{y,...,y7}, x=1,...,k (5) 
T ~ Ber{pr(H1)} 


where, T is a hypothesis indicator, with T = 0 for Ho and T = | for Hı. Under Hy, 
we generate group-specific mixing weights while under Hj we have equal weight 
vectors. In assessing evidence in favor of the alternative, we can rely on the posterior 
probability, pr(H1 | (y1;x1),---, (Yn; Xn) ), or the correspondent Bayes factor, easily 
obtained from the output of a Gibbs sampler. Specifically, under prior (5) the full 
conditional pr(T = 0 | —) = pr(Ho | —) = 1 — pr(Ħ; | —) is analytically available. 

Although rejection of the global null (4) provides evidence of group differences 
in the multivariate categorical random variable Y, such changes may be due to 
several structures. To provide interpretable inference we additionally consider lo- 
cal analyses assessing evidence of group differences in each marginal Y; of Y, for 
J Sf i, tti »P 

We address the above aim by adapting the model-based version of the Cramer’s 
V coefficient [3] to our local tests. Specifically, we measure the association between 
each marginal Y; and X for j = 1,..., p studying the posterior distributions of the 
coefficients. 


d 


i (Py) = Py; (yj)? 
pi PY;(V;) l 


2 
pj = moli Fort) (6) 


where py, (yj) denotes pr(Y; = yj), while py, x (yj,x) = pr(¥j =yj,X =x) =pr(¥j 
yj |X =x)pr(X = x) = py, ‘ite (9)) Bx Ct ) for every y; € (1,...,d;) and group x = 
1,...,k. 

Relying on p; € [0,1] to study variation of the marginal across groups provides 
a convenient choice for interpretation. In fact, according to (6), a value of p; very 
close to 0 suggests low dependence between Y; and X. A discussion on the choice 
of the prior distributions for the involved quantities and an efficient algorithm to 
perform posterior inference from the proposed model are described in [7]. 


3 Application to chess endgame data 


Developing efficient strategies for the endgame part of a chess game presents many 
difficulties, especially when implementing a computer chess program. These diffi- 
culties arise since decisions to be adopted are different from the ones used for the 
first part of the game, strongly depending on which pieces are still on the board 
and on their position. Different ending scenarios can be considered but we focus on 
King-Rook vs. King-Pawn game. The data consists of n = 3196 chess games where 
p = 36 categorical variables indicating the chess board configuration are registered 
together with a variable indicating if the white can win the game or not (refer to [8] 
for a more detailed description of the data) Analysed data are publicly available at 
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the UCI machine learning repository [4]. Our interest is in establishing if the posi- 
tion of the pieces at the beginning of the endgame stage has an impact on the final 
result and if so targeting the piece and positions responsible for such variation. This 
last information might be used to develop refined game strategies, driving the game 
in the final position more suitable for victory or tie. 

We consider 5000 Gibbs samples relying on the hyperparameter settings sug- 
gested in [7] except for the upper bound H = 20. Trace-plots suggest that conver- 
gence is reached after a burn-in of 1000. Results from posterior inference offer in- 
teresting insights on group differences in the considered endgame problem with a 
pr{Hy | (y1;x1),---, (Yn; Xn) } > 0.95 providing evidence that the starting position has 
a deep impact on the final result. Figure 1 shows the posterior mean and 0.9 credi- 
ble interval for the Cramer V coefficient (6) for all the considered variables. We can 
notice how just few position of the pieces may impact on the final result, suggesting 
to drive the game in these lasts to build sophisticated strategies. 


0.204 

0.154 

0.104 + HYS 1 

0054__ T i — | 

E — a C1 sla DT ee = ERTS 

0.004 TL — — == | 

123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 
Fig. 1 Posterior mean, 0.1 and 0.9 quantiles of 6; for j = 1,---,36 considered chess piece/position. 


Dark grey bars are such that pr{f; > 0.10 | (y1,x1), <- -, On Xn) } > 0.95. 
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A Sequential Test for the Cpx Index 
Un test sequenziale per l’indice Cpr 


Michele Scagliarini 


Abstract We propose a new sequential hypothesis test for the process capability 
index Cpk. We compare the statistical properties of the proposed test with the 
properties of non-sequential tests by performing a simulation study. The results 
indicate that the sequential test makes it possible to save a large amount of sample 
size, while type I and II error probabilities are maintained at their desired values. 
Abstract In questo lavoro si propone un nuovo test sequenziale per la verifica 
d'’ipotesi sull’indice di capacità Cpx. Le proprietà statistiche del test sequenziale 
sono confrontate con le proprietà di test non sequenziali mediante simulazioni. I 
risultati indicano che il test sequenziale consente una notevole riduzione 
dell’ampiezza campionaria mantenendo le probabilità degli errori di primo e 
secondo tipo ai valori prefissati. 


Key words: Brownian motion, Monte Carlo Simulation, non-central t distribution, 
power function, process capability indices, sequential test. 


1 Introduction 


One of the process capability indices most widely used in industry today is 
Cy =(d -|u-+(USL-LSL))/30=(d-|u-m|)[30 where is the process mean, o 
is the process standard deviation, LSL and USL are the specification limits, 
d=(USL-LSL)/2 and m=(USL+LSL)/2 [4]. Often, as a part of contractual agreement, 
it is necessary to demonstrate that the process capability index C,, meets or exceeds 
some particular target value, say c,,. Such decision-making problem may be 
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formulated as a hypothesis testing problem: H,:C,, Sc,,9 i.e. the process is not 


capable versus H,:C,, > Cpo i.e. the process is capable. 


pk 

In this study, starting from some of the results obtained by [2], we propose a 
sequential test for the index Cp. Firstly, we review two of the most used non- 
sequential tests for assessing whether a process is capable or not. Secondly, we 
analytically derive the test statistic of the sequential test. Thirdly, we describe in 
detail the testing procedure. Finally, we compare the sequential test properties with 
the performances of the non-sequential tests by performing an extensive simulation 
study. The results show that the proposed sequential test has good power behavior 
and makes it possible to save a large amount of sample size, which can be translated 
into reduced costs, time and resources. 


2 Hypothesis testing on Cpk 


Assuming a normally distributed quality characteristic, X ~ N ( 4 o’) , [5] proposed a 
statistical test (PC-test) based on the distribution of the estimator Č = Dey where 


Č, =(d-(X -m)1,(u2)) /38,, b, = V20[(n-1)/2]/Vn-1T[(n-2)/2], n is the 


sample size, R= SX, n, S= $ (x, x} fo 1), I,(4)=1 if weA and 
i=l i=l 


I,(4)=-1 if wéA with A= fulu m}. Given the type I error probability æ , the 
critical value of the test is C, =b,t,,,,,(6.) / 3yn where t ia (8) is the upper æ 


c 


quantile of a non-central t with n-1 degrees of freedom and non-centrality parameter 
6. =3Nnc,,9. The power function of the PC-test can be computed as 


Toc (CE ) = Prf, (8)> 3Nnc, /b, | where 5=3Vnc,,.,. 
Recently [3], for testing Ho versus Hı, discussed a test (LP-test) based on the 
estimator Ce =(1 -|5])87 , Where y=6/d, ô =(X —m)/d and 


n 


ô= X(x, -X i jh Under the assumption of a normally distributed quality 


i=l 
characteristic, the authors obtained the critical value for the test as 


Cia = tna (5) / 3yn-1 where 6, = 3Vne no . The power function of the LP-test can 


pkja 
be computed as 7,, (c a) =1-F, a (cora) , where F, is the cumulative distribution 
function of C,, and is defined as F, (x)=1— Pr (t, (8) <-t) + Pr (t, (8) <-t) if 


x<0 and F; (x)=1-O,_,(-t,6,30,R)+0, _,(t,6,;0,R) if x>0. In the previous 
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equations 4, ,(5,) and 1, ,(d,) are non-central t variables with n—1 degrees of 


freedom and non-centrality parameters 6, =-3Vn (1-d)C,, (1 -[5]) and 
5, =3Vn(1+6)C,, /(1+|5]) respectively, &=(X-u)/d, 1=34n-1x, 
R=y\n-1(6,-6,)/2t, T is the gamma function, 


Q, (t,6;0,R) = igre O(n/ Jf - 5) x! x)dx, ®© and @¢ are 


respectively the normal cumulative distribution function and probability density 
function. 


3 A sequential test for Cp 


Under the assumption that the data came from a multivariate distribution with density 
function f (x:0), [2] proposed a general sequential testing procedure for testing 


H,:h(0)=0 versus H,:h(@)#0, where h(0):R">R°, with g<d, is a 
function with first order derivative matrix denoted by H (0) with @ unknown. Under 


the standard regularity conditions for the existence of the multivariate maximum 
likelihood estimators the author showed that the statistic 


W, = kh (Ò [2 (0) 0)H(0)] h(6,). where k is the sample size, 6, is a 
consistent estimator of @ and (0) is the Fisher information matrix, can be 
approximated by a functional of Brownian motions. Thus the author [2] proposed as 
test statistic W, =kn(6, ia (ô, ja" (6,)#(6,)] (6) where 6, is the 


maximum likelihood estimator of @ . The a-level sequential test procedure, truncated at 
the maximal allowable sample size no, is performed as follows: 


1. for k =2,3,...,m, compute of the statistic W,? = Jk/n W, ; 
2. hypothesis Ho is rejected the first time that W,” exceeds the critical value w, ; 


3. if W,"” does not exceed w, by n, then do not reject Ho. 


The maximal sample size nọ can be decided on the basis of financial, ethical or 
statistical reasons as, for example, to achieve a desired power level. The critical value 
W,» given the type I error probability a, can be obtained from [1]. 


Let us consider the hypothesis H,:C,, =C,, versus H,:C,, #c,,9 and assume 


that the quality characteristic is normally distributed: X ~ N ( 4,0 Dl Let us define 


h(0) as h(0)=10((6,.) )-tn((c,o) )=tn| (dl) /90° (cxa) |. where 
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0=(1,0°). For C„ 20, H, is equivalent to H,:h(0)=0 and the alternative 

hypothesis is equivalent to H, :(@) #0 . In the case at hand the statistic W, can be 
* A a a A ii A F n 

written as W, =kh? (ô, )[ a" (8) (6,)(6,)| where 6, =(X,,S;) with 

k k n 

X, =x, h and S? = “(x -X, N. . The function h(ò,) is therefore given by 

i=l i=l 


h(ô, ) =In (la -|X,- mi) pos? (co y) and consequently W} can be written as: 
2 


—1 


(a-|x, - ml) 4(signum[X, -m)) S 
2 x = 2 H 
9S? (Creo) [a -|X, —m] 


where signum[a]=a/la| if a #0; signum[a]=0 if a=0. Therefore, given the value 


a) 


W, =k| In 


of a and the maximal allowable sample size n, , the test is performed by computing, for 


k=2,3,..., n, the statistic W:O = Jk/mW, . 


Let n,,,, be the first integer k=2,3,..., ng for which wo >w,: we reject H, if 


stop 
we) >w,; we do not reject H, if w, does not exceed w, by n,. In this 
stp 


framework n,,,, is the stopping sample size of the test. 


stop 


4 Simulation study and concluding remarks 


We study the properties of the sequential procedure by comparing its performances 
with those of the LP and PC-tests. More precisely, we compare the tests in terms of the 
sample size required for achieving a given value of power. Note that the sequential test 
is two sided with composite alternative hypothesis H,:C,, #¢,,9> While the LP and 
PC-tests are unilateral. In order to correctly compare the statistical properties of the 
tests, we considered cases under Hı where C,, =c,,, with Cpi > Cpo- In this manner 
the sequential bilateral test with Type I error probability œ can be compared with the 
non-sequential unilateral tests with Type I error probability equal to a@=a/2. To study 
the properties of the sequential procedure we examined several scenarios where 
different values of C,, under Hı (¢,,,) were considered for the unilateral test with 


a,=0.01, 0.05 and c,,9=1.33, 1.67. For the LP and PC-tests we analytically 
determined the minimum sample size, 1,p.9g) and Mpc:ogo to achieve a power at least 
equal to or greater than 0.80: i.e. 7,, (cua) 20.80 and 7,, (eas) 2 0.80 . As far as the 


sequential test is concerned we used a set of simulation studies. For each value of a, 
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Co and c,,; we generated 10* replicates from a normally distributed quality 
characteristic. The aim of these simulations was to determine the smallest maximal 
allowable sample size, 1.2.9.9, Which gives an empirical power 7, greater than 0.80: 
i.e. 2, >0.80. The empirical power 7, of the sequential test is estimated as the 
proportion of correctly rejected Ho. Note that, in order to obtain 1.4.99, WE 
implemented an iterative search algorithm which allows to determine mp); og) With a 
suitable precision. 


The simulation results are summarized in Table 1 where, for each combination of a, 
Cao and Cpa, the following quantities are reported: 7, sogo the smallest maximal 


allowable sample size for the sequential test for achieving an empirical power 
T,>0.80; na, the average of the stopping sample sizes n,,,, required for the 


‘avg stop 


sequential test with maximal allowable sample size 79.4 „osọ for concluding in favor of 


H,; S.D.(u,,,,) the standard deviation of the final sample sizes n.p; Ê, the estimated 
power of the sequential test. Table 1 contains also 11,p.9g) the minimum sample size 
required by the LP-test for achieving a power level >0.80 and 7.9.4) the minimum 


sample size required by the PC-test for achieving a power level >0.80. 


Table 1: Simulation results under H, with C,, =c for co = 1.33, 1.67 and a=0.02, 0.1 


pk? pk,0 
C pk, Np;0.80 Npc;0.80 1o;#,>0.80 Navg S.D. (na ) î s 
case Cpo =1.33 and a=0.02 
1.60 199 178 171 116.1 29.8 0.811 
1.70 115 104 96 64.7 17.2 0.815 
1.80 75 68 68 41.7 11.2 0.809 
case cy0=1.33 and a=0.1 
1.60 124 108 107 65.8 21.5 0.820 
1.70 70 62 60 36.0 12.7 0.817 
1.80 47 41 39 23.0 8.8 0.819 
case co =1.67 and a=0.02 
2.00 198 181 173 117.2 29.9 0.816 
2.10 125 115 106 71.6 18.3 0.812 
2.20 88 82 74 49.7 13.4 0.819 
case c,,)=1.67 and a=0.1 
2.00 123 109 106 65.2 21.4 0.809 
2.10 78 69 65 39.4 13.6 0.800 
2.20 53 48 45 27.0 10.0 0.813 


By examining the averages of the final sample sizes n,,, the results show that the 


avg 
sequential test, with the same power of the LP and PC-tests, saves a lot of sample size. 
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Furthermore, the maximum allowable sample size Mp. 9) required to achieve the 


desired power is always less than 11,5.) and Mpc... This indicates that even in the 


worst cases the sequential test needs a maximum allowable sample size not greater than 
the sample size of the non-sequential tests. As an example, under H,:C,, =c,,, with 


Ce: = 1.60, when c,,0=1.33 and a=0.02, we have poso =199, recoso =178, 


while with a maximum allowable sample size equal to n =171 the power of the 


0;7,>0.80 


sequential test is 7, >0.80 with an n =116.1(= 117). In this case the sequential 


avg 
procedure saves, on average, 41.2% of the sample size as to the LP-test and 34.3% as 
to the PC-test. Under H, with c,,,=2.0, when c,,,=1.67 and a=0.1, we have 


Azpog9 =123, Npe.ngo =109, while with a maximum allowable sample size equal to 


n =106 the power of the sequential test is 7, > 0.80 with an n,,, =65.2 (= 66) : 


0;#,>0.80 7 
In this case the sequential procedure saves, on average, 46.3% of the sample size as to 
the LP-test and 39.4% as to the PC-test. A further simulation study, conducted under 
Ho, confirmed that the empirical type I error probability of the sequential test does not 
exceed the nominal a-level. 

The results show that the sequential test allows on average smaller stopping sample 
sizes as compared with the fixed sample size tests while maintaining the desired a-level 
and power. Furthermore, the maximum allowable sample sizes required by the 
sequential test to achieve the desired power level are less than, or at most equal to, the 
sample sizes required by the non-sequential tests: this means that, even in the worst 
cases, the sequential procedure uses a sample size that does not exceed the sample size 
of the non-sequential tests with the same power level. Summarizing, the proposed 
sequential procedure has several interesting features: it offers a substantial decrease in 
sample size compared with the non-sequential tests, while type I and II error 
probabilities are correctly maintained at their desired values. 
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Industrial Applications of Bayesian Structural 
Time Series 


Applicazioni industriali delle serie storiche strutturali 
bayesiane 


Steven L. Scott 


Abstract Not every business problem involves time series, but every business has 
time series problems of one sort or another. Bayesian structural time series models 
are a flexible and powerful tool for modeling time series data. The models are addi- 
tive, allowing the analyst to combine latent state components for handling trend, sea- 
sonal, regression, and other structural features. Additivity also makes it easy to place 
informative priors on individual components, like a sparsity-inducing spike and slab 
prior on a regression component when working with large numbers of contempora- 
neous predictors. These methods are encoded in the bsts R package [Scott(2011)], 
which was developed at Google to provide Bayesian time series modeling capabil- 
ities to non-experts in Bayesian modeling. The package has been used for a variety 
of purposes, including nowcasting economic time series, anomaly detection, fore- 
casting, and causal inference. 

Abstract Non tutti i problemi aziendali hanno a che fare con le serie storiche 
ma ogni azienda ha qualche tipo di problema legato alle serie storiche. I modelli 
bayesiani strutturali sono uno strumento flessibile e potente per l’analisi di questo 
tipo di dati. Questi modelli sono additivi e permettono all’analista di combinare 
componenti allo stato latente per la modellare stagionalità, trend, regressione e 
altre caratteristiche strutturali. L’additivita rende anche più facile specificare dis- 
tribuzioni a priori informative sui singoli componenti, come una distribuzione spike- 
and-slab per indurre sparsità sui coefficienti di regressione quando si lavora con 
un gran numero di predittori. Questi metodi sono implementati nella libreria R 
bsts [Scott(2011)] sviluppata a Google per fornire uno strumento per l’analisi 
Bayesiana delle serie storiche a utenti non esperti di modellazione bayesiana. La 
libreria è stata usata per una serie di scopi tra cui il nowcasting di serie storiche 
economiche, l’anomaly detection, la previsione e l’inferenza causale. 


Key words: time series models, Kalman filter, spike and slab prior, variable selec- 
tion, big data 
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1 Structural Time Series Models 


A structural time series model is any model that can be written in the form 


Y =Z +E 


(1) 
%+1=T%+R,M, 


where & ~ N (0,H;), n ~ / (0,0), and n; and £ are independent of all other 
quantities. The value y, is the only observed data in equation (1). The vector of 
latent variables is known as the state of the model, and is used to encode the 
underlying time trend, seasonal pattern, and other desired structure. The remaining 
symbols in equation (1) are structural parameters. These values of 7;, Z;, and R, may 
contain statistical parameters, but more often they consist of appropriately placed 0°s 
and 1’s. Equation (1) allows the modeler substantial flexibility in defining the state, 
by concatenating elements from commonly used state models as appropriate for the 
application at hand. 

For example, a common trend model, known as the local linear trend, can be 
written 


Yt = M+ & 
Me = hi +O + Nir (2) 
O41 = + Nx- 


The state for the local linear trend model is & = (4, 6;), the transition matrix is 


11 
n= (i!) 


R; = h, the 2 x 2 identity matrix, and VAI = (1,0). This model captures the cur- 
rent level of the trend in 4,, and the slope of the trend (the extra amount of u as 
t increases by 1) in &. The trend is a random walk, with a drift term that is also a 
random walk. 

The most commonly used seasonal model (assuming there are S seasons per cy- 
cle) is 


waNTE 


S-1 (3) 
M41 =— £ V+i—s + wr. 


s=1 


Think of the seasonal model as a regression model where the coefficients of S sea- 
sonal dummy variables evolve over time. Rather than leave out one dummy vari- 
able, the regression enforces the constraint that the coefficients must sum to zero 
(in expectation) over the course of a full cycle. Thus the expected value of any one 
coefficient is the negative sum of the remaining coefficients. 
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The state for model (3) is 4 = (%, %-1,---,%—s+2), which stores the most recent 
S— 1 seasonal coefficients. The transition matrix is 


-1 
T,= ( E i (4) 


This matrix computes the mean ¥,; and then shifts all current elements of state by 
one, so that the last one falls off the end. Notice that this model is not full rank. The 
state is S— 1 dimensional, but only a one-dimensional error term is needed. That is 
the reason for the matrix R,, which in this case is a column vector with a 1 in the 
first position and 0’s everywhere else. It expands the one-dimensional error term N 
into an S— 1 dimensional vector that can be added to %. 

A powerful feature of structural time series models is that state components 
can be combined modularly, by concatenating state vectors, concatenating the cor- 
responding Z; vectors, and combining structural matrices 7;, R;, and Q, block- 
diagonally. 


2 Bayesian priors and analysis 


The parameters of a structural time series model are the residual variance H, (often 
but not always taken to be a constant 0°) and any parameters of the state models. 
For many state models the parameters are simply the variances of the error terms 
in various forms of random walks. These can be modeled using inverse gamma or 
inverse Wishart priors. 

Posterior inference for many state models would be trivial if the state a were 
observed at every time point t. Thus structural time series models are obvious can- 
didates for a data augmentation algorithm that alternates between drawing a = 
(01,...,07) from p(a|y, 0) and drawing @ from p(0|a,y). Here y = (y1,...,y7) 
and @ denotes all model parameters. Several papers have been written explaining 
how to directly simulate from p(a|@,y) using a technique known as “forward filter- 
ing backward sampling.” Important early papers include [Carter and Kohn(1994)], 
[Friihwirth-Schnatter(1994)], and [de Jong and Shepard(1995)]. The bsts package 
uses the technique from [Durbin and Koopman(2002)]. Assuming state models are 
combined using the concatenation procedure described at the end of Section 1, then 
state model parameters can be simulated conditionally independently given œ and 
y. 

In many applications it can be particularly useful to include a regression compo- 
nent as part of the model state. The bsts package offers two mechanisms for handling 
regression on contemporaneous predictors (which can of course include lags). The 
first is a dynamic regression, where 


yı =Z" + Bix +e. (5) 
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In this model, each coefficient in B7 = (B,1,...,6, p) obeys a random walk with 


B41, = Bij + j where mj ~ N (0. 7) . An alternative is a static regression model 


Y = Z} oy + BPX, +8. (6) 


Both equations (5) and (6) can be rewritten in the form of equation (1), but it is con- 
venient here to separate the regression from other state components. The equations 
differ based on whether p is subscripted by t. The dynamic model is more flexible, 
but the static model tends to perform better when there are many predictors. 

When facing large numbers of predictors, a sparse “spike-and-slab” prior can be 
placed on the static regression coefficients. Let y denote a binary vector of the same 
length as B, where yj = 1 indicates B; 4 0 and y; = 0 indicates B; = 0. Then the 
joint prior on B and the constant residual variance parameter H, = o° can be written 


p(B,y,0°) = p(Y)p(0°|Y)p(By]y, 0°), (7) 


where By refers to the components of B where yj = 1. 

A convenient choice is to model p(y) as the product of independent Bernoulli 
distributions p(y) = II; vw (1- y)!7%. The 7; are prior quantities to be specified 
by the analyst. The prior can be elicited by asking the analyst for an “expected model 
size,” which is a guess at the number of nonzero coefficients. If the analyst expects 
k important coefficients, then set 2; = k/p where p is the dimension of x;. Specific 
values for individual 7; can also be set, for example if there are certain variables the 
analyst wishes to force in or out of the model. 

A conjugate inverse gamma prior is a convenient choice for p(0?|y). The bsts 
package assumes this distribution is independent of y. Finally, a Zellner prior is 
assumed for p(By|7, 0°). 

From equation (6), notice that y — Zf a; is just the equation of an ordinary regres- 
sion model. Thus conditional on the draw of a, we can integrate By and o? from the 
model and simulate p(y|y, @, @) using a sequence of well understood Gibbs sam- 
pling steps (e.g. [George and McCulloch(1997)]). Then, conditional on the draw of 
y, we can directly simulate from p(By,07|@,y, 0) through conjugacy. 


3 Applications 
3.1 Nowcasting economic time series 


Structural time series models were used by [Scott and Varian(2014)] and [Scott and Varian(2015)] 
to make short term forecasts (or “nowcasts”) of economic time series using readily 

observed predictors that are released more rapidly than official government sanc- 

tioned numbers. The predictors in both cases were data from Google trends. Several 

hundred such predictors were available for any weekly or monthly time series, so 

the spike-and-slab prior discussed above was helpful. 
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3.2 Causal Modeling 


Structural time series models were used by [Brodersen et al(2015)Brodersen, Gallusser, Koehler, R 
as a method of measuring the impact of a market intervention (e.g. an advertising 
campaign). The first step is to fit the models to data observed prior to the market 
intervention. The models are then used to forecast the counterfactual time series of 
metrics (e.g. sales) that would have occurred had the intervention not taken place. 
The difference between the observed series and the forecasted counterfactual is the 
estimated impact. A regression component is a critical feature of this model, be- 
cause contemporaneous predictors unaffected by the intervention can be used to 
improve the counterfactual forecast. Examples of such predictors might include a 
competing firm’s sales, or counts of Google searches for relevant terms. The predic- 
tors are available during the forecast period because the “forecast” is done after the 
conclusion of the market intervention, after which time the relevant predictors are 
observed. 


3.3 Long term forecasting 


Structural time series models can be used for longer term forecasting, although an- 
alysts should think carefully about trend models based on random walks (like the 
local linear trend). The variance of a random walk increases linearly with the num- 
ber of time steps into the future, which can lead to unrealistically large forecast 
errors. Replacing the slope in a local linear trend model with a stationary AR(1) 
process (centered on a long term global trend D) provides additional stability. 


Yt =U + € 
H+1 = 4+ò + Nor (8) 
d41=D+p(è —D)+ nu 


To illustrate the impact of this assumption, consider a forecast of the future 
Google stock price based on the historical data shown in Figure 1. There is clear 
daily volatility in this data, but there is also a clear “up and to the right” trend. 

The top panel of Figure 2 shows a forecast 180 days into the future based on 
the local linear trend model. Notice that the scale of the plot is sufficiently wide 
to make the variation in the stock price over the last 3 years appear constant. The 
local linear trend model is extremely flexible. This flexibility is useful for short term 
forecasts, but for longer term forecasts the local linear trend allows for the possibility 
of hyper-explosive growth or catastrophic failure far beyond anything observed in 
the preceding 10-year period. 

Contrast the bottom panel of Figure 2, which shows a forecast distribution that 
matches the empirical volatility of the historical data much more closely. At the end 
of 2016 the Google stock price was around 790. 
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Fig. 2 Forecast distributions of Google stock price based on the local linear trend model (top 
panel) and the semi-local linear trend model from equation (8). 


3.4 Handling non-Gaussian errors 


When non-Gaussian error distributions are required, structural time series mod- 
els can often still be used by introducing a set of latent variables that render the 
model conditionally Gaussian. Well kWhen non-Gaussian error distributions 
are required, structural time series mod-els can often still be used by 
introducing a set of latent variables that render the model conditionally 
Gaussian. Well known methods exist for probit regression [Albert and 
Chib(1993)] and models with student T errors [Gelman et al(2014)Gelman, 
Carlin, Stern, Dunson, Vehtari, and Somewhat more complex methods exist 
for logistic regression [Frühwirth-Schnatter and Frühwirth(2005)], [Holmes and 
Held(2006)], [Gramacy and Polson(2012)] and Poisson regression [Frühwirth- 
Schnatter et al(2008)Frühwirthnown methods exist for probit regression 
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Additional methods exist for quantile regression, support vector machines, and 
multinomial logit regression, though they are not yet provided by the bsts package. 

To see how non-Gaussian errors can be useful, consider the analysis done by 
[Berge et al(2016)Berge, Sinha, and Smolyansky] who used Bayesian model aver- 
aging (BMA) to investigate which of several economic indicators would best predict 
the presence or absence of a recession. The model they used was a probit regression, 
which took the presence or absence of a recession (as determined by the NBER 
recession determinations: http://www.nber.org/cycles.html) as a response variable. 
The model was highly predictive, but it ignored serial dependence in the data. The 
BMA done by [Berge et al(2016)Berge, Sinha, and Smolyansky] is essentially the 
same as fitting a logistic regression under a spike-and-slab prior like the one de- 
scribed in Section 2 with all 7; = 1/2. Running that analysis using the Boom- 
SpikeSlab R package [Scott(2010)] (similar to bsts, but without the time series) 
largely replicates their results (up to minor Monte Carlo error). 

To capture serial dependence, consider the following dynamic logistic regression 
model with a local level trend. 


logit(p:) = Mr + B' x, 


(9) 
Mi = Me + N 

O yj 
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Fig. 3 Distribution of state (on logit scale) for recession data. Blue dots show the true presence or 
absence of a recession, as determined by official statistics. 


Here p; is the probability of a recession at time f, and x; is the set of economic in- 
dicators used by [Berge et al(2016)Berge, Sinha, and Smolyansky] in their analysis. 
The distribution of 4 is plotted in Figure 3, which shows it going to very large val- 
ues during a recession, and to very small values outside of a recession. This reflects 
the fact that recessions are rare, but once they occur they tend to persist. Assum- 
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ing independent time points is therefore unrealistic, and substantially overstates the 
amount of information available to identify logistic regression coefficients. The pos- 
terior distribution of the coefficients in model (9) gives fewer nonzero coefficients 
than the corresponding analysis assuming independent observations. 
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Asymptotically Efficient Estimation in 
Measurement Error Models 


Stima Asintoticamente Efficiente in Modelli con Errori di 
Misurazione 


Catia Scricciolo 


Abstract Inference on linear functionals of the latent distribution in measurement 
error models is considered. The issue about asymptotically efficient estimation by 
maximum likelihood in a convolution model with Laplace error distribution is set- 
tled in the affirmative: maximum likelihood estimators of certain linear functionals 
of the mixing distribution are ,/n-consistent, asymptotically normal and efficient. 
Asymptotic normality of a Studentized version of the maximum likelihood estima- 
tor allows to construct confidence intervals for linear functionals. Regarding maxi- 
mum likelihood estimation of the mixing distribution as a data-driven choice of the 
a priori distribution on the mixing parameter in an empirical Bayes approach to the 
problem of estimating the single means, a sequence of estimators can be constructed 
such that it is asymptotically optimal in a decision-theoretic sense. 


Abstract È d’interesse fare inferenza su funzionali lineari della distribuzione la- 
tente in modelli con errore di misurazione. Il problema della stima asintoticamente 
efficiente basata sul metodo della massima verosimiglianza in un modello miscuglio 
con distribuzione di Laplace degli errori è risolto in senso affermativo: gli stimatori 
di massima verosimiglianza di taluni funzionali lineari sono consistenti, asintotica- 
mente normali ed efficienti. La normalità asintotica di una versione studentizzata 
dello stimatore di massima verosimiglianza consente di costruire intervalli di con- 
fidenza per funzionali lineari. Riguardando la stima di massima verosimiglianza 
della distribuzione misturante come la scelta guidata dai dati della legge iniziale 
sul parametro di mistura in un approccio empirico-bayesiano al problema di stima 
delle singole medie, è possibile costruire una successione di stimatori che sia asin- 
toticamente ottimale in un inquadramento decisionale del problema. 


Key words: asymptotic efficiency, asymptotic normality, empirical Bayes, Laplace 
mixture model, maximum likelihood estimate, mixing distribution. 
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1 Introduction and Main Results 


The problem of asymptotically efficient estimation by the maximum likelihood 
method of linear functionals in measurement error models is considered. The is- 
sue about asymptotic efficiency of the maximum likelihood estimator (MLE) for 
certain linear functionals in a convolution model with a Laplace error distribution 
is settled in the affirmative. Under regularity conditions, the MLE is \/n-consistent, 
asymptotically normal and efficient, even though typically the unknown latent dis- 
tribution can only be estimated at slower rates. Considered a consistent estimator 
of the efficient asymptotic variance, a Studentized version of the re-centered MLE 
also converges to a standard normal distribution. This allows to construct asymptotic 
confidence intervals for linear functionals and to make inference about them. 


Model Description 


Let X be a real-valued random variable (r.v.) with unknown distribution Py. Assume 
that Po possesses density po with respect to Lebesgue measure À on R, that is, po := 
dP) /dA. Assume that 

X=Y+44Z, (1) 


with Y and Z independent, unobservable random variables. The distribution Go of Y 
is unknown, while Z has Laplace(0, 1) distribution with probability density function 
LE 
fz) = 30 4, zeR. 

Then, po is the convolution of fz and Go or, in other terms, a location mixture of 

Laplace densities with mixing distribution Go supported on some set Y C R, 


pots) = f felx—y)AGo(y) = 5 fe 0600), +eR. 


In what follows, we write pg, in place of po when we want to stress the dependence 
of po on Go. We observe n independent and identically distributed (i.i.d.) copies 
X1,..., Xn of X satisfying relationship (1), 


X=Y+Z;, i=1,...,n. 


We observe the noisy data X1, ..., Xn instead of the uncorrupted r.v.'s Y1, ..., Yn. The 
i.i.d.r.v.s Z1,..., Zn represent additive errors and their known (Laplace) distribution 
is called the error distribution. In this classical additive measurement error model 
we have E(X|Y) = Y and X varies around Y. Interest in this model is motivated 
by the fact that measurement errors occur in nearly every discipline from medical 
statistics to astronomy and econometrics. 


Asymptotic Efficiency of Linear Functionals of the MLE for the Mixing Distribution 


We are interested in estimating linear functionals of Go of the form 
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00 = 9G) = (a a(y) dGo(y) (2) 


with a: Y + Ra given function. Linear functionals of Go describe different features 
of the unknown mixing distribution. For example, if, for any fixed yo € Y CR, the 
function a(y) = 1—0, yo] (y), then the linear functional 6) = P(Y < yo) = Go(yo) is 
the cumulative distribution function of Y evaluated at the point yo; if a(y) = y, then 
the linear functional 6) = EY is the mean of Go which, for a zero-mean error density, 
coincides with the mean EX of the observations, 


EX= [spad f [ xfelx—y)avaGoly) = f yaGo(n) = EY, 


with a(y) = y = Jpxfz(x —y) dx. If, for a fixed real number s in a neighborhood 
of zero, the function a(y) = e”, then the linear functional 0) = fz, e” dGo(y) is 
the moment generating function (m.g.f.) of Go at the point s, denoted by My (s). 
Relevant aspects of the mixing distribution Go, such as the mean and the variance, 
can be expressed as functionals of the m.g.f. My(-). Hence, virtually all results in 
statistical estimation of characteristics of Go can be obtained as by-products of the 
inference on the m.g.f. My(-). In some cases, simple naive estimators of ® are 
available. For example, an estimator of the mean of Y is the sample mean of the 
observations X = X}; X;/n; an estimator of the m.g.f. My (s) is (1 — s?) £; et /n 
for |s| < 1, namely, the ratio between the empirical m.g.f. of the observations and 
the m.g.f. of the error r.v. Z which is equal to (1 —s?)~! for |s| < 1. 

A principled method for estimating linear functionals of Go as in (2) is that of 
estimating Go by the MLE G, and then plugging G, into the expression of 060 
to obtain the MLE ô, = 06, The (nonparametric) MLE Gy of Go is a measurable 
function of the observations X), ..., Xn taking values in the collection 4 of all prob- 
ability measures on (Y, A(Y)), with A(Y) the Borel o-field on Y, such that 


n 


f 1 
Gn € arg max — x log pg(Xi) = argmax | (log pc) dP,, 
Geg "iZ GEG 


where pc(-) := fa fz(- —y) dG(y) is the location mixture of Laplace densities with 
mixing distribution G € Y and P, := n~! £}; dy, is the empirical measure associ- 
ated with the random sample Xi, ..., Xn, namely, the discrete uniform distribution 
on the sample values that puts mass 1 /n on each one of the observations. We assume 
that the MLE exists, but do not require it to be unique. Lindsay [3] showed that the 
MLE G, is a discrete distribution supported on at most k < n support points, k being 
the number of distinct observed values or data points. Any linear functional as in (2) 


can then be estimated by the (plug-in) MLE 
6, = 0%. = i aly) 46,00). 
24 


We study the behaviour of 6, to answer the question of whether there are functionals 
of Go that can be consistently estimated using the maximum likelihood method at 
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\/n-rate and for which the estimator \/n(6, — 8) is asymptotically normal and 
efficient, in the sense that it has asymptotically normal distribution with mean zero 
and minimum variance. In fact, in the theory of asymptotic efficiency of the MLE, 
see Chapter 11 in van de Geer [6], there may be linear functionals of Go which 
can be estimated at /n-rate, even if Go itself can only be estimated (at best) at a 
slower rate, which, in the present case, is Op(n_!/5) relative to the L'-Wasserstein 
distance, see Dedecker et al. [2]. A related issue is the existence of estimators that 
are empirical means of certain transformations of the observations, like the ones 
previously considered, which may not be equal to the MLE, but are asymptotically 
efficient. 

To establish asymptotic normality and efficiency of linear functionals of the MLE 
Gn, we can appeal to Theorem 11.8 of van de Geer [6], pp. 217-220. We prelim- 
inarily introduce some more notation. For some o, | 0, let (On) = f po<on PO di 
and (0) := f po>a,(1/Po) dA. Here below we state the assumptions we shall be 
using: 


(A1) |ô, — 0] = op(1), 

(A2) for 0, = n 3/8 10g!/8 n, we have T? (On) = 0(6?) and t3(o,) = O(|log onl), 

(A3) either a is bounded or Go is compactly supported, 

(A4) d(y) := da(y)/dy exists and ||d||.. < +00, 

(A5) there exist constants 0 < c1, c2 < +e such that, for every y € support(Go) CR, 
| d(ea(y)) 


<cı and 
dGo(y) È | 


(A6) My (1) := fa e dGo(y) < +% and My (—1) := fy e dGo(y) < +. 


We now make some remarks on Assumptions (A1)-(A6). Assumption (A1) is 
satisfied if a is continuous, which is guaranteed by (A4), and either a is bounded or 
Go is compactly supported, jointly with G, weakly converges to Go almost surely, 
in other terms, Gn is strongly consistent at Go. Sufficient conditions for strong con- 
sistency of the MLE Gy are stated in Theorem 2.3 of Chen [1]. Assumption (A2) 
proposes the same conditions employed by Scricciolo [5] in Proposition 4 to es- 
tablish the rate of convergence in the Hellinger metric for the MLE of a Laplace 
mixture. The result is obtained adopting a convenient approach according to which 
it is the dimension of the class of kernels and the behaviour of the sampling density 
po near zero that jointly determine the rate of convergence for the MLE. Assumption 
(A4) is standard and is used to establish boundedness of subdirections and influence 
curves. The remaining Assumptions (A5)-(A6) are technical and are used for the 
same purpose. We are now in a position to state the main result. 


Proposition 1. Under Assumptions (A1)-(A6), 
vn(ô, = 00) +? N (0, T), 


where g is the efficient asymptotic variance. 
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If 2 is a consistent estimator of the efficient asymptotic variance Ti then the 
Studentized version of the recentered MLE is asymptotically distributed according 
to a standard normal, \/nt,!(6, — 80) +£ ~ (0,1) and asymptotic confidence 
intervals for @ can be constructed as well as hypothesis testing be made. 


Asymptotic Optimality of the MLE in the Empirical Bayes Approach to the 
Decision Problem of Estimating the Single Means 


Maximum likelihood is a principled method for estimating the mixing distribution 
and leads to asymptotically efficient estimation of certain linear functionals. In a 
decision-theoretic framework, if the problem of estimating the mixing distribution 
in a mixture model is regarded as that of selecting the a priori distribution on the 
mixing parameter by a data-driven choice in an empirical Bayes approach to statisti- 
cal decision problems when, to say it with Robbins [4], “the same decision problem 
presents itself repeatedly and independently with a fixed but unknown a priori dis- 
tribution of the parameter”, then the MLE has also some optimality property. A 
description of the elements of the Bayesian decision problem is as follows: 
e aparameter space © with generic state of nature 0, 
e an action space A with generic action a, 
e aprior distribution G on 0, 
e ar.v. X taking values in 2 C R such that X|(O = 0) ~ fo(-) := fz(-- 0). 
The statistical decision problem consists in choosing a decision function t: 2° — A 
that has conditioned expected loss R(t, 0) := fy L(t(x), 0) fe (x) dx when @ is the 
parameter. By averaging over 0 when the a priori distribution is G, we get the 
overall expected loss or Bayes risk R(t, G) = fo R(t, 0) dG(0). 

Consider n independent repetitions of the same decision problem that give rise to 


(81, x1), (82, x2), ..., (On, Xn), 


with xl) = (x1, -.-,%n) observed and 01, ..., 0, unknown. When a decision has 
to be made about 0,1, the observations (x), ...,.Xn+1) are available, whereas the 
values 01, ..., 9.41 remain unknown. One can use a function t of x”) evaluated at 


the point xn+1, that is, tn (xn+1) = t(Xn1; x). An empirical decision procedure is 
then a sequence T = {tn} with expected loss 


Ra(1.6):= | f, f Lt). 8) folx) d0(0) APG (x) de. 


The sequence T is said to be asymptotically optimal (a.o.) relative to every G ina 
class & if, for every G € Y, the expected loss R,(7, G) is asymptotically equal to 
the minimum Bayes risk relative to G. In symbols, 


forevery GE, limR,(T, G) =infR(t, G). 
n+o0 t 


It is known from Robbins [4] that an a.o. sequence 7 relative to some class G exists 
if we can find a sequence G, of distribution functions that converges in distribution 
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to G whichever is G € Y. The MLE G, is one of such sequences if it is strongly con- 
sistent. Then, a sequence T can be constructed following the indications of Corollary 
1 in Robbins [4]. Consider a finite action space A = {a0, a), ..., dm} and any G € G 
such that fa L(aj, 9)dG(@) < +% for j =0,...,m. If, for every j=0,..., mand À- 
almost every x, [L(a;, 9) — L(ao, 0)]fe(x) is continuous and bounded as a function 
of 0, then defined A; n(x) := fo[L(a;, 0) — L(ao, 0)]fa(x) dGn(@) and 


tn(x):=ax when Ag,(x) = cin Aral), 


the sequence T = {t,} is a.o. relative to G. 


2 Final Remarks 


We have considered the problem of asymptotically efficient estimation by the max- 
imum likelihood method of linear functionals of the unknown mixing distribution 
in a standard additive Laplace measurement error model: the MLE of certain linear 
functionals is y/n-consistent, asymptotically normal and efficient. 

Maximum likelihood estimation of the mixing distribution can also be regarded 
as the selection of the a priori distribution on the mixing parameter by a data- 
driven choice in an empirical Bayes approach to the problem of estimating the single 
means. When the MLE Gy is strongly consistent, a sequence of estimators T = {tn } 
for the single means can be constructed based on G, such that it has expected loss 
asymptotically equal to the minimum Bayes risk. Since, however, G, is not explic- 
itly known, it would be interesting to investigate when a sequence T = {t} can be 
constructed based on simple naive efficient estimators 6, of 60. 
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On the noisy high-dimensional gene expression 
data analysis 


Angela Serra, Pietro Coretto and Roberto Tagliaferri 


Abstract The main goal of microarray experiments is to identify, within thousands 
of genes, groups that show similar co-expression patterns. In most cases the analysis 
starts from the estimation of a sample correlation matrix used to construct the input 
dissimilarity. However, the sample correlation matrix is highly distorted by the pres- 
ence of outlying experimental units, and the typical large ratio between the number 
of genes and the number of patients. We review the joint action of these issues, and 
we discuss some possible remedies. We consider real data from some well known 
microarray experiments, and we perform cluster analysis based on both the usual 
sample correlation, and some “cleaned” alternatives. Finally, we investigate on the 
differences between the obtained groups and we draw some conclusions. 


Key words: Outliers, high-dimensional data, gene expression data, DNA microar- 
rays. 


1 Introduction 


A major role of DNA microarrays is to find genes that behaves similarly across 
various experimental conditions. This biological co-expression concept translates 
into the technical notion of statistical similarity. Co-expression can be measured in 
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several different ways, however the correlation-based similarity plays a crucial role 
in gene expression data analysis. The main practice is to perform clustering taking 
correlation-based dissimilarity matrix as input. The groups found by the following 
cluster analysis are highly dependent on the chosen similarity measure. In the next 
Section we discuss two major problems with these classical correlation measures: 
lack of robustness, and excessive variations due to the typical high-dimensional set- 
ting. 

This paper reviews the impact of these issues in the context of DNA microarrays 
and possible alternatives already proposed in the literature (see Section 2) are also 
reviewed. As a major contribution, we investigate what happens to the routinely 
applied cluster analysis. We show that the obtained partitions differ substantially 
from those found based on classical correlation-based dissimilarities (see Section 3). 
Final remarks and an overview of future research projects are discussed in Section 
4. 


2 Issues and cures 


In this section we will review some of the major issues in estimating correlation 
matrices in large scale genomic studies. Cures proposed in the literature are also 
reviewed. 


Excessive noise. “Getting the noise out of gene arrays” is the evocative title of 
the popular paper by Marshall (2004), where it was explained that microarrays data 
sampling is terribly noisy, and this undermines the possibility to reach scientific con- 
sensus on the empirical evidence. Biologists attribute the “excessive noise” to strong 
differences in experimental conditions (see Yang et al., 2002) and data acquisition 
platforms (see Wang et al., 2005). This “excessive noise” is essentially called “data 
contamination” in classical robust statistic. However, excessive noise is not always 
strictly related to data contamination. Often non regular data points, also known as 
outliers, are representative of a minority subpopulation as pointed in Coretto and 
Hennig (2016). This happens particularly in genomic studies where one cannot ex- 
pect completely homogeneous populations. Whatever the cause is, it is well known 
that few outlying data points can completely break down the sampling correlation 
matrix. Even though this is a well-known problem within the statistical community, 
few efforts have been made in genetic studies. Hardin et al. (2007) proposed a robust 
metric based on the Tukey’s biweight statistics, while Bickel (2003) proposed clas- 
sical rank based dissimilarity measures. Both contributions aim to solve the problem 
of pairwise correlation estimation. However, it is well known (see Maronna et al., 
2006) that pairwise estimation does not necessarily lead to a well behaved correla- 
tion matrix estimator. The additional problem here is that DNA microarrays involve 
thousands of genes, and generally all high break-down covariance/correlation matrix 
estimators are not well defined in the high-dimensional setting. Serra et al. (2017) 
developed the Rmap, a robust correlation matrix estimator based on the robust pair- 
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wise correlation estimator developed in Pasman and Shevlyakov (1987). The Rmap 
is simple to compute and does not require tunings. Although Rmap performs better 
than classical estimators, Serra et al. (2017) showed that in high-dimensional setting 
the advantages of a robust procedure are counterbalanced by the additional issue de- 
scribed below. 


Effects of large concentration ratios. Estimation of covariance matrices under the 
high-dimensional regime has been central in recent years. The high-dimensional 
regime is the situation where n = the number of sampling units is smaller than p = 
the number of genes. A large concentration ratio, that is p/n, drives the bias of the 
sampling correlation matrix to unacceptable levels. When n << p the spectral com- 
ponents of the sampling covariance/correlation matrix are dramatically distorted so 
that most classical dimensional reduction techniques (e.g. the PCA) would fail to re- 
cover appropriate and informative data subspaces. Large p/n also causes the emer- 
gence of many spurious correlations, and this has a strong impact in gene network 
reconstruction. Although in genomic studies there are several methods to filter out 
the too many small correlations, it seems that it is overlooked that the problem is 
mainly due to the additional noise introduced by the large p/n. Concentration ra- 
tions larger than p/n = 25 are rather common in gene expression data sets. Since 
thousands of genes are sampled, it is generally thought that only a relatively small 
proportion of pairs are actually co-expressed. This translates into a sparsity assump- 
tion. Under sparsity assumptions two classes of statistical methods have recently 
emerged: (i) penalized methods, (ii) thresholding methods. One way to “clean” 
the effects of a large p/n is to estimate the precision matrix based on penalized 
likelihood-type estimators (Yuan and Lin, 2007; DAspremont et al., 2008; Friedman 
et al., 2008; Rothman et al., 2008; Cai et al., 2011). However, there are drawbacks: 
(i) these methods do not lead to scalable computations, (ii) they only provide sparse 
estimates of the inverse covariance matrix, while we are interested in the correla- 
tion matrix. Thresholding simply cuts off relatively small covariances/correlations 
(see Bickel and Levina, 2008; El Karoui, 2008; Rothman et al., 2009; Cai and Liu, 
2011). Thresholding estimators are simple to compute and easy to interpret. Bickel 
and Levina (2008) also proposed random cross-validation to optimally choose the 
threshold parameter. In the gene expression context the problem has been addressed 
by Serra et al. (2017), where it is proposed to regularize the Rman based on cross- 
validated hard thresholding. The resulting estimator, the Robust Sparse Correlation 
matrix (RSC) showed remarkable performances in both synthetic and real data. Per- 
haps the key finding of Serra et al. (2017) is that only jointly treating robustness and 
high-dimensionality produces the desired improvement. 


3 Evidence from DNA microarray 


Experiments have been performed on two real gene expression data sets related 
to Breast Cancer. The TCGA.BRCA data set was downloaded from the Cancer 
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Dataset 
BB ocana 
{REST 
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150. 
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0 


Fig. 1 Number of clusters obtained with the Kendall, Pearson, RSC, and Spearman correlation- 
based dissimilarity for both for the Oxford (a) and TCGA (b) data sets. 


Genome Atlas (TCGA) (https://tcga-data.nci.nih.gov/tcga/). It 
consists of 151 patients and 4100 genes. The OXF.BRCA data set, described in 
Buffa et al. (2011) was downloaded from NCBI GEO (http://www.ncbi. 
nlm.nih.gov/geo/) under the accession number GSE22219 and GSE22220. 
The two data sets were already preprocessed. As a further step, genes with low vari- 
ance were eliminated and batch effect removal was performed with the comBat 
method available in the sva R package of Leek et al. (2011). For each data set cor- 
relation matrices have been estimated by using the RSC, Pearson, Spearman, and 
Kendall methods. These matrices were used as dissimilarity measures, and given as 
input to the hierarchical clustering method based on the complete linkage algorithm. 
Once the hierarchy between the genes is obtained, the optimal number of clusters 
is estimated with the dynamic tree cut approach of Langfelder et al. (2008, 2016). 
The latter detects the clusters in the dendrogram based on the shape of the branches. 
This method was proven by Langfelder et al. (2008) to have better performance 
compared to the classical fixed height dendrogram cut methodology when applied 
to protein-protein interaction network and gene expression data. In fact, the dynamic 
tree cut algorithm is able to identify nested clusters, and to identify outliers. Since 
tiny clusters of genes are to be avoided, the user can set the minimum number of 
genes needed to create a cluster. In our experiments this number is set to 15. 


The differences between the obtained clusters were investigated. Figure 1 shows 
that the estimated number of clusters is significantly lower when using the RSC- 
based dissimilarity. The reason for this is that both the contamination and the large 
concentration ratio of these data sets cause the emergence of a huge number of spu- 
rious small correlations. The RSC correlation matrix can filter out these artifacts, 
so that the resulting number of connected genes is greatly reduced. The strong dif- 
ference can also be seen in terms of dissimilarity between the optimal partitions. 
Partition dissimilarity is evaluated based on the Rand Index and the Normalized 
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Fig. 2 Heatmaps of the Rand Index and the NMI comparing clusterings obtained based on RSC, 
Pearson, Spearman and Kendall correlation matrices. Panel (a) reports results for the Oxford data 
set, panel (b) reports results for the TCGA data set. 


Mutual Information (NMI). 


Let X = {x1,...xn} be a set of objects, and consider two partitions to compare, 
let them be Cl, = {X},...X}}, and Ch = {X?,...,X2.}. The Rand index is defined 


as follow: 
a+b _a+b 


a+b+c+d 6) 


where a is the number of pairs of elements in X that are in the same subset in Cl; 
and in the same subset in Ch), b is the number of pairs of elements in X that are 
in different subsets in Cl; and in different subsets in Cl, c is the number of pairs 
of elements in X that are in the same subset in C/; and in different subsets in Ch, 
d is the number of pairs of elements in X that are in different subsets in Cl, and 
in the same subset in Cl). The Rand Index R € [0, 1] with R = 0 meaning no agree- 
ment between the partitions, and R = 1 meaning complete agreement between them. 
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The normalized mutual information (NMI) between the two partitioning Cl; and 
Ch is defined as follow: 


I(Cl, Cl) 


NMI = ———- , 
H(Ch) +H(Cl) i 


where /(C/1,Cl2) is the mutual information between the clusterings Cl; and Cb, 
H(Cl,) and H(Cly) are entropy of Cl, and Ch respectively. NMI € [0,1], where 
NMI = 0 means that there is no dependence between the two clustering, and 
NMI = | means that they are strongly dependent. 


Again, figure 2 shows how the clustering obtained using the RSC matrix is dra- 
matically different from those based on the competing methods. These strong dif- 
ferences may lead to interesting biological conclusions. Here we do not face the 
challenge to understand which method is better, in fact, this would need deep bio- 
logical investigations. However, the analysis reported shows that taking into account 
the sources of the extra noise in an estimation step that is instrumental to the final 
clustering causes differences of unexpected magnitude. 


4 Concluding remarks 


In this work, the influence of the correlation matrix used as dissimilarity measure 
in the genes clustering is investigated. A hierarchical clustering algorithm with a 
dynamic tree cut approach was applied on two real gene expression data sets. It 
happened that the RSC correlation matrix produced clusters that are different form 
those obtained based on classical correlation measures. The study confirms that the 
effects of data contamination, and the high-dimensionality in gene expression data 
need to be considered carefully. We cannot argue which clustering is better, this 
needs to be investigated in future researches based on biological validation. 
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Variable selection for (realistic) stochastic 
blockmodels 


Selezione di variabili per modelli stocastici a blocchi 


Mirko Signorelli 


Abstract Stochastic blockmodels provide a convenient representation of relations 
between communities of nodes in a network. However, they imply a notion of 
stochastic equivalence that is often unrealistic for real networks, and they comprise 
large number of parameters that can make them hardly interpretable. We discuss two 
extensions of stochastic blockmodels, and a recently proposed variable selection ap- 
proach based on penalized inference, which allows to infer a sparse reduced graph 
summarizing relations between communities. We compare this approach with max- 
imum likelihood estimation on two datasets on face-to-face interactions in a French 
primary school and on bill cosponsorships in the Italian Parliament. 

Abstract Sebbene i modelli stocastici a blocchi consentano di rappresentare conve- 
nientemente le relazioni fra gruppi di nodi in una rete, essi comportano una nozione 
di equivalenza stocastica spesso irrealistica in reti reali, e richiedono l’impiego di 
numerosi parametri che li rendono spesso difficilmente interpretabili. Oggetto di 
questo lavoro sono due estensioni di modelli stocastici a blocchi, ed un metodo di 
selezione di variabili fondato sulla penalizzazione della funzione di verosimiglianza 
che consente di derivare una rappresentazione grafica delle relazioni fra gruppi 
di nodi. Tale approccio è confrontato con lo stimatore di massima verosimiglianza 
tramite due applicazioni su una rete di interazioni in una scuola primaria francese 
e sulla cosponsorizzazione delle proposte di legge nella Camera dei Deputati. 


Key words: adaptive lasso; network; penalized inference; reduced graph; stochas- 
tic blockmodel; variable selection. 
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1 Introduction 


There is a long tradition in the study of graphs and relational data, whose origins 
can be arguably traced back to the seminal works of Moreno (1934) and Erdòs and 
Rényi (1959). For decades, however, the study of real networks was limited by the 
difficulty to collect comprehensive data on large and complex systems. At the turn of 
the XX century, network science received a sudden boost from many technological 
advances that have facilitated the collection of relational data in a plentiful of fields. 
Examples include the advent of high throughput technologies in genetics and of 
functional magnetic resonance imaging in neuroscience, as well as the development 
of sensor-based measurements and the diffusion of social media in social network 
analysis. 

The increasing availability of data on real networks has fostered research on their 
focal properties. These include the famous “small-world property”, encapsulated in 
the idea of “six degrees of separation” between any two inhabitants of the Earth, 
and the idea that networks are scale free, i.e., that a few nodes in a network ac- 
count for most of the connections therein. A further commonly observed feature of 
real networks is the presence of groups of nodes (“communities”) that are highly 
connected to each other, and poorly connected to the rest of the network. This com- 
munity structure may be induced by observed attributes of the nodes, or it could 
be thought as the result of an unknown latent factor. In this paper we focus on two 
extensions of stochastic blockmodels a priori, a class of network models that allow 
to relate such community structures to observed attributes of the nodes!. Although 
stochastic blockmodels are a convenient way to represent relations between groups 
of nodes in a network, they require a large number of parameters, which increases 
quadratically with the number of groups. As a consequence, when a large num- 
ber of groups is considered, they typically yield cumbersome results that are hard 
to interpret. Signorelli and Wit (2016) proposed to address this issue by estimat- 
ing stochastic blockmodels in a penalized likelihood setting. This allows to perform 
variable selection for stochastic blockmodels, to reduce model complexity and to de- 
rive a sparse reduced graph that summarizes the most important interactions within 
and between communities. 

The paper is organized as follows. In Section 2 we discuss how the stochastic 
blockmodel can be extended so as to incorporate information on the degrees of 
nodes and on nodal or edge-specific covariates, and how to derive a reduced graph 
that summarizes relations between communities. Section 3 shortly introduces the 
variable selection approach proposed by Signorelli and Wit (2016). In Section 4, we 
illustrate the proposed methodology with two examples on face-to-face contacts in 
a French high school, and on bill cosponsorship in the Italian Parliament. 


! A related class of blockmodels is that of a posteriori stochastic blockmodels (Wasserman and 
Anderson, 1987), whose aim is the detection of communities rather than the description of relations 
between known blocks of nodes. 
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2 Representing community structure with realistic stochastic 
blockmodels 


We consider an undirected graph Y = (V, E), which features a set of edges E C V x 
V between a set of nodes or vertices V = {1,...,n}. We denote by A the (symmetric) 
adjacency matrix of the graph, whose entries a;; are non-null if and only if an edge 
between nodes i and j is present, and we assume absence of self-loops, i.e., aij = 
0 Vi € V. Moreover, we distinguish binary graphs, where aj; € {0,1}, from edge- 
valued graphs where a;; € N, and we view each a;; as a draw from the random 
variable Y;;. 


2.1 Stochastic blockmodel: definition and extensions 


A stochastic blockmodel assumes that a partition Y of V into p blocks of nodes 
{B1,...,Bp} is available. According to the definition proposed by Holland et al. 
(1983), a network model is a stochastic blockmodel if 


e the random variables Y;; are independent; 
e Y;; and Y are identically distributed if nodes i,k belong to the same block B,, 
and nodes j,/ to the same block B,. 


This definition implies that every node within the same block is stochastically equiv- 
alent, to wit, that it is possible to swap any two nodes that are members of the same 
block without affecting the probability distribution of the graph. 

The assumption of stochastic equivalence within blocks represents a strong lim- 
itation of stochastic blockmodels. For example, it entails that the expected degree 
of nodes within a block is the same, whereas most real networks feature a strong 
heterogeneity in the distribution of node degrees. This was noted already by Wang 
and Wong (1987), who proposed to integrate the stochastic blockmodel with a set of 
nodal fixed effects. For undirected binary graphs, their degree-corrected blockmodel 
assumes that if i € B, and j € Bs, then Y;; ~ Bern(7;;) and 


logit 7); = Po + Qi + Qj + Os, (1) 


subject to the identifiability constraints Y; œ; = 0 and Ès rs = 0 Vr € {1,...,p}. 
Here, a positive block-interaction effect @,; indicates that nodes in blocks B, and 
Bs tend to interact preferentially with each other. Note that Equation (1) breaks 
the assumption of stochastic equivalence of nodes within a block and, thus, the 
model proposed by Wang and Wong (1987) is not a stochastic blockmodel in the 
sense of Holland et al. (1983). However, it allows a more realistic description of a 
network with known block-structure: as a matter of fact, it takes into account both 
nodal information on the popularity or productivity of each node (aj and @;), and 
information on the extent of interaction between pairs of blocks (@,;). 
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A further limitation of stochastic blockmodels is that they postulate that the for- 
mation of edges depends only on block membership of the nodes. However, often 
it is reasonable to imagine that other factors besides block membership can affect 
the process of edge formation. Signorelli and Wit (2016) proposed an extension to 
stochastic blockmodels that allows the formation of an edge to depend both on block 
memberships, and on a set of nodal or edge-specific covariates x;;. They considered 
the case of an undirected, edge-valued graph and viewed the formation of an edge 
between i € B, and j € B, as the result of a Poisson process whose rate depends both 
on blocks B, and Bs and on the covariates x;;. The resulting network model can be 
estimated with a generalized linear model where Y;; ~ Poi(4;;) and 


log Wij = Bo+xijB+%+%4 Os, (2) 


subject to the identifiability conditions Y,y = 0 and X; @rs = 0 Vr € {1,..., p} 
Likewise model (1), also model (2) breaks the assumption of stochastic equivalence 
within blocks. However, it differs from model (1) in two aspects: it allows to account 
for factors other than group membership, and it replaces the nodal fixed effects œ; 
with block effects y.. 


2.2 How to derive a reduced graph 


The focal point of a stochastic blockmodel and of its (more realistic) extensions out- 
lined above is their capacity to summarize a (potentially large) network by making 
some statements on the relations that exist between the blocks (Anderson, 1992). In 
particular, stochastic blockmodels make it possible to infer from the observed graph 
G a reduced graph Gp = (P, Er) whose nodes are the blocks. 

The reduced graph represents a synthetic way to visualize the relations that exist 
between blocks in the network. Typically, it is employed to show which blocks inter- 
act more with each other. For binary graphs, Anderson (1992) proposed to derive a 
reduced graph from a stochastic blockmodel by setting a threshold on the predicted 
interaction probability 7,; to observe an edge between nodes in blocks B, and By. 
However, the reduced graph obtained with this procedure arbitrarily depends on the 
choice of the threshold and, furthermore, it might display some blocks as connected 
to any other block, just because its nodes have, on average, high degrees. Moreover, 
this procedure does not directly generalize to the case of edge-valued graphs. 

To overcome these problems, Signorelli and Wit (2016) derive the reduced graph 
in a different way, drawing an edge between two blocks B, and B, if the estimate 
ò,s of the corresponding block-interaction parameter @,; is positive. This approach is 
coherent with the parametrizations employed in models (1) and (2), where a positive 
rs entails evidence of attraction between B, and Bs. Thus, the resulting reduced 
graph will display those pairs of blocks whose nodes tend to interact more with each 
other. The reduced graphs presented in Section 4 are obtained with this method. 


Variable selection for (realistic) stochastic blockmodels 931 


3 Variable selection for stochastic blockmodels 


The description of relations between pairs of blocks provided by stochastic block- 
models requires the use of a rather large number of parameters. This is neces- 
sary in order to model each interaction between blocks (B,, Bs), s > r € {1,...,p}. 
In particular, model (1) includes q; = n + p(p — 1)/2 parameters, and model (2) 
q2 = dim(B) + p(p + 1)/2. As we will show in Section 4, when many blocks are 
considered (p > 10) this often yields reduced graphs with a plentiful of links that 
are cumbersome to interpret. 

In a study on collaborations between Italian political parties, Signorelli and 
Wit (2016) analysed bill cosponsorship networks in the Chamber of Deputies with 
model (2) and observed that although positive and negative estimates @,, respec- 
tively entail collaboration and repulsion between Deputies in parties B, and B,, it 
is also possible to imagine a situation of indifference between collaboration and re- 
pulsion for some pairs of parties. This indifference directly corresponds to @,; = 0 
in models (1) and (2). However, with maximum likelihood estimation it is highly 
unlikely that any of the point estimates ¢,, will be exactly zero. For this reason, 
they advocated the penalization of the block-interaction terms @,; (as well as of the 
covariate vector f in Equation (2)) and employed the adaptive lasso (Zou, 2006) to 
estimate their model. 

This penalized inference approach yields two advantages: on the one hand, it 
allows to distinguish situations of indifference between blocks from collaborations 
or repulsions; on the other hand, it is capable to reduce the complexity of the inferred 
model by shrinking some of its parameters to 0. As a result, it enables to infer a 
sparse reduced graph, which is typically easier to interpret than the one based on 
maximum likelihood estimation. 

We remark that because of the identifiability conditions that ought to be im- 
posed in models (1) and (2), p block-interaction parameters do not directly appear 
in the models and, thus, they cannot be penalized. Given that the parameters for 
interactions within each block, @,,, are anyway likely to be positive, we substitute 
Orr = — Ls zr Ors for every r € {1,...,p} in (1) and (2). By doing so, we penalize each 
block-interaction parameter @,s (r # s), and derive each @,, from the constraints. 

In the next Section we consider two examples of penalized inference for stochas- 
tic blockmodels, and carry out a comparison of this approach with the one based on 
maximum likelihood. 


4 Applications 


4.1 Face-to-face contacts in a French primary school 


We consider data on face-to-face interactions in a French primary school collected 
by Stehlé et al. (2011). The study, which lasted 2 days, employed sensors to detect 
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face-to-face interactions between students and teachers that lasted at least 20 sec- 
onds. Here, we focus on the interactions measured in the first day and consider a 
binary graph whose nodes are students and teachers, and where an edge between 
two nodes indicates that at least an interaction between them was recorded during 
the day. 

The school comprises 10 classes (2 for each level). The available information for 
each node is its status (student or teacher); furthermore, for students also class and 
gender are known. Thus, we partition the nodes into 21 blocks: 20 blocks partition 
students according to their class and gender, and the last one contains teachers. 

We employ model (1) to study the pattern of interactions among the blocks, and 
compare the reduced graphs that can be derived by employing maximum likelihood, 
and the penalized likelihood estimation procedure described in Section 3. 

Maximum likelihood estimation results into 86 positive, and 145 negative, esti- 
mates of the block-interaction parameters. As a result, the reduced graph in Figure 
2 displays a large number of interactions between the blocks. Penalized likelihood 
estimation, instead, shrinks 88 block-interaction parameters to 0, resulting into 52 
positive and 91 negative parameter estimate $,;. A direct consequence of this is that 
the reduced graph displaying interactions between blocks is now more readable. In 
particular, the presence of self-loops indicates that members within each block in- 
teract frequently with their peers. Furthermore, a link is present between male and 
female students within each class. Whereas students in their fifth grade also interact 
across classes in their same grade (5A and 5B) irrespective of gender, the pattern of 
interaction between the two third grade classes (3A and 3B) seems to be affected 


Reduced graph based on the Reduced graph based on the 
maximum likelihood estimator adaptive lasso estimator 


Fig. 1 Comparison of reduced graphs based on maximum likelihood and penalized likelihood 
inference, displaying interactions between groups of students (and teachers) in a French primary 
school. Node colors denote grades and their shapes distinguish female (circle) from male (square) 
students. The label of each block indicates the grade (1-5), the section (A or B) and the gender (F 
or M) of students. The white circular node indicates the block of teachers. 
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by gender identity (males in 3A interact with males in 3B, and females in 3A with 
females in 3B). Instead, we do not find any interaction between first, or fourth grade 
classes (1A-1B and 4A-4B, respectively). 


4.2 Bill cosponsorship in the Italian Parliament 


Signorelli and Wit (2016) employed data on bill cosponsorships to reconstruct the 
pattern of collaborations between Italian political parties in the Chamber of Deputies 
from 2001 to 2015. Here we focus our attention on the bill cosponsorship network 
for the first part of the XVII legislature (2013 - 2015) and make a comparison be- 
tween maximum likelihood and penalized likelihood inference. 

We define a bill cosponsorship network where a weighted undirected edge is 
present between two deputies if they have cosponsored together at least one bill. 
Edge weights represent the number of bills that each pair of deputies has cospon- 
sored. During the XVII legislature, 10 parliamentary groups are represented in the 
Chamber. Those groups form the blocks in model (2), where we furthermore con- 
sider covariates for gender, age and seniority of deputies, besides a dummy vari- 
able that indicates whether two deputies have been elected in the same electoral 
constituency. In the penalized model, we penalize each of the covariates and the 
block-interaction terms, and we employ the adaptive lasso for estimation. 

Table 1 compares the results for the intercept Bo and the parameter vector B. 
Here, the only (slight) difference is that the parameter for age difference is shrunk 
to 0 with the adaptive lasso. The other variables indicate that female and senior 
deputies are more active in cosponsorships, and that geographic proximity also in- 
creases the tendency to collaborate. 

The main difference between the two approaches lies in the estimation of the 
block-interaction parameters $,;. Maximum likelihood yields 29 positive, and 26 
negative, estimates of the block-interaction parameters; the adaptive lasso, instead, 
shrinks 16 of those parameters to 0, resulting into 21 positive, 16 null and 18 neg- 


Table 1 Comparison of maximum likelihood and adaptive lasso estimators for the parameter vec- 
tor B in model (2). The reference modes are interactions between two male deputies (male-male) 
for gender effects, and between two junior deputies (junior-junior) for seniority. 


Covariate Maximum likelihood Adaptive lasso 
estimate estimate 

Intercept (Bo) -3.83 -3.86 
Female-male interaction 0.233 0.210 
Female-female interaction 0.659 0.634 
Same electoral constituency 0.550 0.554 
Age difference -0.011 0 
Junior-senior interaction 0.253 0.234 


Senior-senior interaction 0.700 0.712 
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Reduced graph based on the Reduced graph based on the 
maximum likelihood estimator adaptive lasso estimator 


ED 


Fig. 2 Comparison of reduced graphs based on maximum likelihood and penalized likelihood 
inference, displaying collaborations between Italian political parties. Node size is proportional to 
group productivity. The colour of nodes is lightblue for left-wing parties, orange for right-wing 
ones, yellow for “Scelta Civica”, green for “Movimento 5 Stelle” and white for the mixed group. 


ative estimates. Once more, the reduced graph of collaborations based on maxi- 
mum likelihood is rather cumbersome to interpret, whereas the one based on the 
adaptive lasso is more readable. In particular, the latter points out collaborations 
within each party, between the 4 right-wing parties (orange), between three parties 
(‘Centro Democratico”, ‘Scelta Civica’ and ‘Area Popolare”) that belong to different 
coalitions, between the two main left-wing parties, and that deputies in the ‘mixed 
group’ tend to collaborate with left-wing parties and with the ‘Movimento 5 Stelle’. 
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Detection of spatio-temporal local structure on 
seismic data 


Individuazione di strutture locali spazio-temporali su dati 
sismici 


Marianna Siino, Francisco J. Rodriguez-Cortés, Jorge Mateu and Giada Adelfio 


Abstract For the description of the seismicity of an area, the comparison between 
local features of background and induced events could be a new perspective of re- 
search. In spatio-temporal point process, local second-order statistics provide infor- 
mation on the relationships of each event and its nearby events. In this paper, we use 
a test based on local indicators of spatio-temporal association (LISTA functions) for 
identifying different local structures comparing the two previous sets of events. We 
present a simulation study on the test and show the main results of the application 
on Greece earthquake data. 

Abstract In questo lavoro si propone una nuova prospettiva di analisi per la de- 
scrizione della sismicità di un’area considerando il confronto delle caratteristiche 
locali degli evendi di fondo e quelli indotti. Nell’ambito dell’analisi dei processi 
puntuali di tipo spazio-temporale, le statistiche del secondo ordine locali descrivono 
la relazioni esistenti tra ciascun evento e i suoi più vicini. Per identificare differenze 
tra i due insiemi di eventi individuati in precedenza, utilizziamo un test basato sugli 
indicatori locali di associazione spazio-temporale, chimati funzioni LISTA. Presen- 
tiamo uno studio di simulazione e i principali risultati applicando la metodologia 
sui dati sismici della Grecia. 


Key words: earthquakes; local indicators of spatio-temporal association; second- 
order product density function 


Marianna Siino 
Dipartimento di Scienze Economiche, Aziendali e Statistiche, Università degli Studi di Palermo, 
Palermo, Italy 


Francisco J. Rodriguez-Cortés 
Department of Mathematics, Universitat Jaume I, Castellon, Spain 


Jorge Mateu 
Department of Mathematics, Universitat Jaume I, Castellon, Spain, 


Giada Adelfio 
Dipartimento di Scienze Economiche, Aziendali e Statistiche, Universita degli Studi di Palermo, 
Palermo, Italy, e-mail: giada.adelfio @unipa.it 


935 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


936 Marianna Siino, Francisco J. Rodriguez-Cortés, Jorge Mateu and Giada Adelfio 


1 Introduction 


In an observed area, earthquake events can be considered as a realization of a marked 
space-time point process, where the magnitude is the mark, and a point is identified 
by its geographical coordinates and time of occurrence (Illian et al, 2008). Generally, 
the description of seismic events requires the definition of more complex models 
than stationary Poisson process since clustering structure characterises these events. 
Therefore, spatio-temporal cluster analysis has a relevant role in the comprehension 
of seismic processes. 

Commonly, global spatio-temporal second-order summary statistics (such as the 
K- and pair-correlation functions) are used to detect deviations from the Poisson 
assumption (Gabriel and Diggle, 2009). These tools play a fundamental role in the 
phase of descriptive analysis, in model validation and for testing procedures giving 
global information of a given point pattern. An interesting question may concern 
if the same conclusions are valid locally, and thus, for example, in testing proce- 
dures if in subregions of the spatio-temporal window the pattern behaves differently 
identifying specific regions where the null hypothesis is not accepted. 

Anselin (1995) proposed the idea of considering individual contributions of a 
global estimator as a measure of clustering under the name of Local Indicators of 
Spatial Association (LISA). In spatial point processes, Cressie and Collins (2001) 
propose a local product density function developing theoretical properties, namely 
first- and second-order moments, of these functions. Some applications of LISA 
functions are in Mateu et al (2007) and Moraga and Montes (2011). Rodriguez- 
Cortés (2014) and Siino et al (2016b) extend the concept of LISA function to the 
spatio-temporal point pattern context defining the LISTA functions. A brief sum- 
mary on this methodology is in Section 2. Moreover, Siino et al (2016b) develop 
a testing procedure for the local structure comparing spatio-temporal point pat- 
terns based on Local indicators of spatio-temporal association (LISTA) functions 
described in Section 3. A simulation study is performed to illustrate that the test 
proposed has the prescribed size (Section 4). For the analysis in Section 5, we con- 
sider earthquakes occurred in the Hellenic area between 2005 and 2014. We aim to 
detect which triggered events have a significant different local cluster structure with 
respect to the underlying process, represented by the background events, linking the 
results with the geological information available in the study area. 


2 Methodology 


We consider a spatio-temporal point process with no multiple points as a random 
countable subset 2 of R? x R, where for a point (u,t) € 2, u € R? is the spatial 
location and ¢ € R is the time of occurrence. In practice, an observed spatio-temporal 
pattern is a finite set {(u;,t;)}{_, of distinct points within a bounded spatio-temporal 
region W x T C R? x R, where usually W is a polygon with area |W| > 0 and T a 


single closed interval with length |T| > 0. Considering a bounded spatio-temporal 
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region A C W x T, Y (A) denotes the number of the events of the process falling in 
A. The intensity of a process is defined as (Diggle, 2013) 


lim [Y (du x dr)] 
|duxde|30  |du x de| 


p(u,t) = 


where du x df is a spatio-temporal region around the point (u,1), |du| is the area of 
the spatial region, |dt| is the length of the time interval and, E(Y (du, d)) denotes the 
expected number of events in the infinitesimal spatio-temporal region. The process 
is called homogeneous or stationary when the intensity is constant, p(u,t) = p for 
all (u,t) EW xT. 

When the interest is in describing the spatio-temporal variability and correlations 
between points of a pattern, we have to consider second-order measures, such as the 
product density p2(.,-). This quantity provides an interpretable measure of the 
spatio-temporal dependence structure and it is defined as 


[Y (du; x dt;)Y (du; x dt;) 
e e: po |du; x di;||du; x dr;| 


p® ((w;,t); (u;,t;)) = (1) 


where du; x dt; and du; x dt; are small cylinders around two distinct points (uj, f;) 
and (uj,tj). 

Under the stationary case, and ignoring edge-effects, a global naive non-parametric 
kernel estimator for pl) (r,h) in (1) (Rodriguez-Cortés, 2014) is given by 


pd i =r, |ti- t;|— 2 
PO e alrh) = ALE slim u;||-r,|ti-tl-A), D 


where the sum is over all pairs (u;,t;) 7 (uj,tj) of the data points, B = W x 
T, r > €> 0 and h > ô > 0. The kernel function « has a multiplicative form 
Kes (lu; —u;|| — r, |ti —t;|—h) = Kie (u; — u;|| — r) kag (|ti — tj] — h), where xie and 
ks are kernel functions with bandwidths € and 6, respectively. For an approxi- 
mately unbiased edge-corrected estimator for the spatio-temporal product density 
see Rodríguez-Cortés (2014). The R package stpp (Gabriel et al, 2013) imple- 
ments the main code for the computation of the the estimator in (2). 

Considering the spatio-temporal product density in (1), its local version is de- 
noted by prt (-,-). Rodrfguez-Cortés (2014) extends the operational definition of 
local indicator introduced by Anselin (1995), for fixed r and h, it holds that 


oe 


PA. g(1,h) nai 


ni 


Loi, alh), (3) 


= 


seek, 


An unbiased edge-corrected kernel-based estimator for p2)/(r,/) is given by 


op yee ol ujl| 7/4 —tj|—h) 


inn pa: w(u;,,u;)w(t;,t;) 


pig s(r,h) = (4) 
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with r > € > 0, h > ô > 0, for (u;,ti) EW x T,i=1,...,n, w(uj,u;) and w(t;,t;) 
are the edge-effect factors. For formal theoretical details on the LISTA functions see 
Rodríguez-Cortés (2014) and for the implementation the function LISTAfunct in 
the GitHub repository pdLISTA (Rodríguez-Cortés, 2016). 


3 Testing procedure 


The test procedure is an extension of the test proposed in Moraga and Montes (2011) 
into the spatio-temporal context. It detects differences in the local structure of two 
given point spatio-temporal patterns X and Z. We test the null hypothesis of no 
difference in the spatio-temporal local structure of X and Z with respect to the i-th 
point (u;,t;) € X, where the number of points in the two patterns are respectively 
N(X) =nand N(Z) =m. The steps of the testing procedure are the following. 


1. For each point (uj,f;) € X, for i= 1,...,n the LISTA function pl°), 5(r,h) is 
estimated. 

2. Secondly, for each fixed point (u;,t;) € X, k point patterns are generated under 
the null hypothesis. For each fixed point (u;,t;), k local spatio-temporal product 


density surfaces are estimated, pia. a(r, h) for q = 1,...,k. They are sum- 


marised in terms of the average surface, denoted by Piro (1h). 
3. Based on the previous quantities, the following statistic is considered 


ho ro 
P —. , 2 
T= | f (0e ath) -Ph lh) drah, (5) 
0 0 


were ro and ho are chosen using the Diggle’s rule (Diggle, 2013). 

4. The theoretical distribution of our statistics under the null hypothesis is not 
known, so we rely on simulation-based empirical distributions. Fixing a point 
(u;,t;) € X, the estimated value of the statistic, is compared with the empirical 
distribution of the k values of Ti with q = 1,...,k that are obtained computing 
the test between the g-th generated LISTA surfaces under the null hypothesis 
and their sample mean function. The p-value of 7° is the following ratio pi = 
Xk I(T, > T')/k. The null hypothesis is rejected if p' < @, where a is the 
type I error. 


4 Simulation study 


A simulation study with some scenarios is carried out to assess the performances in 
terms of type I error of the test introduced in the previous section. 


Detection of spatio-temporal local structure on seismic data 939 


The patterns are generated in the unit cube, W x T = [0,1]? x [0, 1] and varying 
the type of process (Poisson, Poisson cluster). We also consider E[N(W x T)] = 
n+m= {150,300} for X UZ. Under the null hypothesis, a pattern is generated with 
expected number of points equal to n +m, and the points are randomly associated 
to the pattern X or Z such that the number of points for the two sets is equal, and 
the test is computed for all the points belonging in X. For each point, the number of 
permutations is equal to k = 99. 

The spatio-temporal Poisson point patterns are generated using the function rpp 
in the package st pp of R. The Poisson cluster processes are simulated using rpcp 
in stpp. Given the values of n+ m and the dispersion parameters, we control the 
degree of clustering by changing the expected number of parents (np = {5, 10}) and 
the number of offspring points with respect to each parent. 

For each of the resulting scenarios under the null hypothesis, 100 pairs of patterns 
of (X,Z) are generated, and the type I error probability is defined as the proportion 
of points belonging to X for which the null hypothesis is rejected considering a fixed 
nominal value of œ. Table | presents the average and the variance of the p-values 
under Ho with the rejection rate for œ = 0.05. The statistical test exhibits acceptable 
empirical rejection rates for the several scenarios. There are no remarkable differ- 
ences in the results when changing the intensity, the type of the process and the 
degree of clustering. 


Scenarios Dispersion n+m hp € ô Ti 
X Z parameters Rej. Mean Var 
P P - 150 - 0.134 0.080 0.057 0.490 0.087 
- 300 - 0.111 0.069 0.055 0.487 0.086 


PC PC {h,r} = (0.26;0.13) 150 5 0.094 0.069 0.053 0.500 0.086 
15 0.114 0.074 0.048 0.493 0.084 

300 5 0.082 0.061 0.047 0.500 0.085 

15 0.095 0.065 0.047 0.511 0.085 


Table 1: Rejection rates (Rej.) at © = 0.05, the mean and the variance (Var.) of the p-values for the 
statistic. The spatio-temporal models considered are homogenous Poisson point processes (P)and 
Poisson cluster point processes (PC). Dispersion parameters for the PC model are given in the 
table, n +m is the total expected number of points for XU Z, and np is the expected number of 
parents for the PC model. For each scenario, 100 simulations are considered, € and 6 are the 
bandwidths in space and time, respectively. 


5 Application 


In the seismological context, a background event refers to an earthquake that has 
not been triggered by another and that might be related to changes in the tectonic 
field. On the other hand, triggered events are thought to have been caused by a pre- 
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vious earthquake. Globally, these two set of events present different spatio-temporal 
global interaction structure, however it can be of interest to compare them focusing 
on a local scale. 

In this application, we consider earthquake events occurred in the Greek area 
between 2005 and 2014 with a magnitude greater than 4 (Figure la), for a total 
number of 1105 events. Its complex spatial multiscale structure has been analysed 
in Siino et al (20162). 

The earthquakes are classified into background and induced events using a 
declustering procedure: a probability of being independent events is assigned to 
each one and it comes from an algorithm for the estimation of Epidemic Type 
Aftershocks-Sequences (ETAS) model (Ogata, 1988). We fitted the model using the 
R package etasFLP (Chiodi and Adelfio, 2014) based on the method developed in 
Adelfio and Chiodi (2015). We use the final probabilities provided in the last step of 
the iteration procedure to classify the events with a magnitude greater than 4 into the 
two groups, obtaining 580 background events and 525 triggered events (Figure 1a). 

Considering the two clusters of independent and induced earthquakes, we would 
answer the following research questions: Is there a different global structure be- 
tween the two point patterns? Which triggered events have a significant different 
local structure with respect to the underlying process (background events)? Is there 
any geological justification for the identified clusters? 

As expected, the estimated spatio-temporal product density of the spontaneous 
events does not show any particular behaviour. On the other hand, for the induced 
seismicity, there is a spatio-temporal clustering around at £ < 300 days and r < 
65 kilometres, in terms of temporal and spatial distances, respectively (Figure 1b). 
However, we further aim to detect if we can obtain different conclusions focusing 
on a local scale detecting the spatio-temporal clusters. 


as 


(a) Spatio-temporal back- (b) Estimated global product (c) Spatial density of the sig- 
ground (A) and induced (e) density of background and nificant induced events and 
events induced events the seismic sources 


Fig. 1: (1a) Scatterplot of the spatio-temporal earthquake data classified in background events (tri- 
angles) and induced events (points) according to the procedure of declustering using the ETAS 
model. (1b)Estimated global product density for the background events (black surface), and in- 
duced ones (grey surface) with bandwidths € = 30.44 km and 6 = 44.14 days. (1c)Image plot in 
space of the significant induced events and seismogenetic sources. 
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We apply the testing procedure of Section 3. The point pattern Z is represented by 
the background events and the events in X are the triggered ones. The representation 
of the results of the significant points in space allows to interpret them in relation 
to the geological information available in the study area (Figure 1c). We can iden- 
tify some areas in which the induced events (with a magnitude greater than 4) are 
different in terms of spatio-temporal local structure than the background seismicity: 
islands of Kefalonia and Zakynthos and the Samos area (East Aegean Sea). The dif- 
ferent behaviour is due to their specific geological characteristics and, in particular, 
to a higher fracturaction degree of their seismogenetic volumes. These results con- 
firm our idea that the observed seismicity is generated by a complex model, char- 
acterised by spatial-temporal interaction, with events happening at several scales, 
and with spatial inhomogeneity related to the geological information available in 
the study area. 


6 Final remarks 


We deal with a non-parametric testing approach for spatio-temporal point processes, 
in order to compare the local structure of two spatio-temporal point patterns (say X 
and Z). The used statistic leads to approximately valid test and the results in terms of 
type I error are reasonably good. Using the aforementioned test, we compare back- 
ground and induced seismicity with a magnitude greater than 4 in the Greek area. 
It seems that the sequences of events that are strongly different to the underlying 
process, are placed in specific regions of the study window. 

As a possible future development, we may consider further simulation scenarios 
to assess the power of the test. Moreover, it could be interesting to define other lo- 
cal tests based on the LISTA surfaces, changing what is postulated under the null 
hypothesis. With the analysis of the LISTA surfaces for a given point pattern, we 
can explore how individual points are related to their neighbouring events, cluster- 
ing surfaces in order to classify points with similar spatio-temporal local structure. 
Moreover, we could develop a diagnostic tool based on the LISTA functions com- 
puting a weighted version of them by the inverse of the intensity function, looking 
for points with a more relevant contribution to the global summary statistics. 
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Bayesian Mixture Models for the Detection of 
High-Energy Astronomical Sources 


Modelli Mistura Bayesiani per la Rilevazione di Sorgenti 
Astronomiche ad Alta Energia 


A. Sottosanti, D. Bastieri, A. R. Brazzale 


Abstract The search of gamma-ray sources in the extra-galactic space is one of 
the main targets of the Fermi telescope project, which aims to identify and study 
the nature of high energy phenomena in the universe. Starting from a collection of 
photons, we perform an unsupervised analysis using a Bayesian mixture model with 
an unknown number of components to determine the number of gamma ray sources 
in the map. The parameters of the model are estimated using a reversible jump 
MCMC algorithm. We finally propose a new method which exploits the distributions 
of both the weights of the mixture components and the energy spectra to qualify the 
nature of each cluster. 

Abstract La rilevazione di sorgenti gamma nello spazio extra-galattico è uno dei 
principali obiettivi del telescopio Fermi, nato con l’intento di studiare la diversa 
natura dei fenomeni ad alta energia. Partendo da un insieme di fotoni, viene pro- 
posta un’analisi non supervisionata tramite un modello mistura bayesiano con un 
numero finito e non noto di componenti, al fine di individuare il numero di sorgenti 
luminose in una mappa. Per stimare i parametri del modello, viene proposto un 
algoritmo reversible jump MCMC. Viene infine discussa una nuova procedura per 
individuare la vera natura dei gruppi attraverso l’analisi dei pesi di mistura e degli 
spettri di energia. 


Key words: Astrostatistics, Bayesian Statistics, Finite Mixture Model, MCMC 
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1 Introduction 


The detection of astronomical sources is an interdisciplinary field which includes 
both statistical and astronomical methods. This work analyzes the gamma rays de- 
tected by Fermi LAT in the energy window 1 to 300 GeV, grouped into rectangular 
bins using galactic coordinates. The principal goals are: i) to determine the number 
of overlapping sources; ii) to measure their intensities; iii) to pool the individual 
counts into clusters. 

In this paper, we present a statistical model which uses the direction of photons 
to identify the coordinates of sources in the analyzed extra-galactic space. Instead 
of using the Pixel-by-Pixel approach discussed in [3], we implement a Bayesian 
algorithm which simultaneously estimates the number of sources in the map, their 
coordinates and their intensities. We also extend [2] when the background contami- 
nation can not be assumed to be uniformly distributed over the entire map. 


1.1 Fermi LAT data 


We start from a collection of y-ray photons detected by the LAT telescope on-board 
the Fermi satellite, which is designed to record high energy particles. The map rep- 
resented in Figure | is plotted in galactic coordinates. The red bright band in the 
middle part corresponds to the Milky Way, and the flare particular in the center in- 
dicates the presence of a black hole. This region is called galactic space, while the 
blue part of the map is the extra-galactic space. 

One of the main goals of the Fermi satellite is to detect new y-ray sources in 
the extra-galactic space, such as active galactic nuclei (AGN), or in our own galaxy 
such as supernova renmants (SNR) and pulsar wind nebula (PWN). 

These sources are typically point-like, but there are also diffuse sources like the 
so-called isotropic diffuse gamma-ray background (IGRB) that, as its name implies, 
is uniformly distributed in the plane of directions, and it will not be taken into ac- 
count in this analysis. 

In addition, since our analysis will initially deal with extragalactic sources, we 
remove all photons with a galactic latitude value in [— 10°, 10°], presumably belong- 
ing to our galaxy. 


Fig. 1 Whole sky map at y- 
ray wavelengths accumulated 
over six years of operations; 
the minimum observed value 
of energy is 1 GeV (http: 
//fermi.gsfc.nasa. 
gov/) 
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2 Bayesian Finite Mixture Modelling 


2.1 Model Specification 


For a generic photon i = 1,...,N, we consider its galactic coordinates (X;, Y;), where 
X represents the longitude and is defined in the interval [—180°, 180°], while Y is 
the galactic latitudine and is defined in [—90°, 90°]. We want to study how particles 
are scattered in the space, and then infer the position of the sources. 

Consider a photon generated by a source j: its direction is randomly distributed 
as 


(Xi,¥i) |My ~ PSF (uj), (1) 


where  ; = (Lx, Hjy) represents the unknown coordinates of the source j. Here we 
use a 2-dimensional King profile distribution (see [2], Appendix C). 

If instead we know that a certain photon was not emitted from a source, we 
assume this particle comes from the background contamination; we hence specify a 
distribution function also for this component. 

For instance, [2] propose to assume a uniform distribution over the entire map. If 
we take a look at Figure 2, we can easly see that this assumption is not verified. The 
histogram of photon counts related to longitude shows a peak around 0°, while lati- 
tude presents a descending trend while moving away from the center of the galaxy. 
We extend the model specification given in [2] for Fermi LAT data by considering 
the bi-dimensional distribution 


(X;,Y;)|Op inet Unif (—Xmin; Xmax) x Lap(0, Op). (2) 


The longitude of a photon emitted by the background contamination is then mod- 
eled as a uniform over the observed range, without taking into account the peak at 
the center of the galaxy, while its latitude is modeled as a Laplace distribution with 
mean 0 and scale parameter 0). The two components can be taken as indipendent 
because of the isotropy of the background. 


Fig. 2: Left: histogram of longitude values of Fermi LAT photons. Right: histogram of latitude 
values. 
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In practice, we have no information about the real number of sources and their 
coordinates in space. The origin of each photon is unknown. We further need to take 
into account the background contamination, which has a masking effect on signal 
sources. We thus translate these assumptions into a statistical model,which uses a 
mixture of components, one for each source in the map, and one for the background 
contamination, that is 


K 
(Xi Yi) ~ o g4(Xi, Yi) + $, 0;f;(X;,t|0;). (3) 
j=l 


In (3), gp(.|.) represents the distribution of photons from the background, f;(.|.) 
represents the source j and @ = (@p,..., @x) is a vector of weights, which can be 
viewed as the intensity of each component. 

Note that the number of sources K is assumed to be unknown, and will be esti- 
mated as the other parameters of the model. 

In particular, for fj we consider the density function defined in (1), while g, 
assumes the form described in (2). We also have @ = {op} and 0; = {u;}. The 
total set of unknown model parameters is then Ox = {@, 0, ..., Ox }. The goal is to 
make inference on (Ox, K). 

From the Bayesian point of view, each unknown parameter is a random variable. 
Since the number of sources K is itself unknown and must be estimated, we attach 
to it a probability distribution. Here we put 


Ko Poi;(K, Knin, Kmax), (4) 


where Poi, is a truncated Poisson, defined in the interval [Kmin, Kmax]. The a priori 
distributions for @ and u; are chosen as in [2], while for 0) we choose an Inverse 
Gamma distribution, which is the conjugate prior of the Laplace distribution. 


2.2 Simulation Algorithm 


The main challenge of our model lies in the fact that the dimension of the parameter 
space is itself unknown. 

Reversible Jump Markov Chain Monte Carlo was introduced for the first time in 
[1] to infer model parameters when the model specification is uncertain. We propose 
a two step simulation algorithm. 

In the first step, we fix the dimension of the parameter space and we simulate 
from the posterior distribution of Ox using Gibbs Sampling and the Metropolis- 
Hastings algorithm. The updating sequence starts with an allocation of photons into 
the identified sources and the background. We then update the posterior distribution 
of the weight vector œ, the coordinates of each source uj and the scale parameter 
of the background oy. 
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Algorithm 1 Revervible Jump MCMC - split move 


: procedure SPLIT j INTO jı, j2 WITH PROBABILITY by = 0.25 (from k to k+ 1 sources) 
u1,u2,u3 ~ Beta(2,2), v ~ Unif (0,1) 
Oj, <u, 0j and @), + (1-4); 
compute 4 j, Hj, from Hj, @j, , @j, (see [2]) 
compute quantities px+1,px,g(1,u2,u3) and J (see [2]) 
qk = be/k and qk4ı — dk+1/(k+1) 
if argmin, ||}, ,44)|| =. and argmin,||4},,t,|| = jı then 
qk+1  2GkH1 
A+ pi+19x+13/pxax8g(U1,12,13) 
if v < min(1,A) then accept split. 


QUO LIDUARWNT 


i 


In the second step a trans-dimensional jump is proposed, which evaluates whether 
to increase or decrease the number of sources by one. The choice is made randomly; 
split, combining, birth and death moves are used with equal probability to jump 
among different dimensions of the space, according to [4]. The pseudocode for the 
split move is given in Code Box (1). 

Each step of the algorithm adds a source to or removes it from the mixture. 
The background contamination is thus left unchanged and therefore it will never be 
excluded from the model. 

The final estimation of (Og, K) depends on the number of sources included in 
the model; we fix K equal to the mode of its posterior distribution. 


3 Application to Fermi LAT data 


We now apply the proposed technique to the Fermi LAT data; in particular, we focus 
on the region of the sky map represented in Figure 1 with longitude values less than 
—10° and latitude values larger than 10°. This choice excludes the contamination 
coming from the center of the galaxy. 

Previous analyses identified in this sector 54 sources '. We now want to evaluate 
the behaviour and the clustering performance of our method when applied to a map 
characterized by an high value of background contamination, as it is the case for 
the Fermi LAT data. The list of discovered sources is used as a benchmark for the 
clustering performance of our model. 

We start the procedure running 10,000 iterations of the reversible jump MCMC 
algorithm from different starting points to explore the entire map; Figure 3 shows 
the posterior distribution of K produced by the first chain. 

Although it emerges from this graph that the most visited value of K is 147, this 
number most likely does not represent the real number of sources in the analyzed 
map. This result seems instead to be the side-effect of the rather strong background 
contamination, which both masks the signal and thus leads to the detection of false 


l https://fermi.gsfc.nasa.gov/ssc/data/access/lat/2FHL/ 
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Known Sources Unknown Groups 
H Only Weights 8 14 

Only Energies 10 12 

Both 6 3 


Lia ul ull 


Fig. 3: Left: posterior distribution of K. Right: results obtained from classification with three pro- 
posed methods. 


positives. A statistical method to discriminate between real and fake clusters is thus 
necessary. 

We propose to use an empirical analysis of the posterior distributions of the 
weights œ and of the energy spectra of the clusters. From our simulation studies 
(results not shown here), it emerged that at times the reversible jump MCMC over- 
estimates the number of components of the mixture in the presence of background 
contamination; however, the posterior distributions of weights associated to these 
false groups are very close to 0. This empirical result leads us to select as sources 
all those clusters with a median value of the weights higher than a defined threshold, 
which we fix to 0.01. 

A similar empirical approach can be applied to the energy spectra of the clusters, 
and selects those groups with high levels of energy. In particular, we selected 125 
GeV as the threshold for classification with energy spectra. 

Table in Figure 3 compares our results after discrimination with the list of pub- 
lished sources. It emerges that 6 clusters selected after both weight and energy dis- 
crimination coincide with known sources. If instead a single classification method 
is applied, both, known sources and unknown groups, emerge from the results. 
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Causal analysis of Cell Transformation Assays 
Analisi causale dei Cell Transformation Assay 


Federico Mattia Stefanini 


Abstract A Cell Transformation Assay (CTA) is an in vitro method to test a chem- 
ical for carcinogenicity. In a recent contribution from an international expert group 
created to improve the analysis of BALB/c 3T3 CTA data, two classes of models 
in the frequentist paradigm were recommended. Here a Bayesian model for poten- 
tial outcomes is developed to estimate the causal effect of some concentrations of a 
candidate carcinogen on counts of foci growing within Petri dishes. The reanalysis 
of an actual case study is performed to illustrate some limitations of current models 
and the main features of the proposed approach. 

Abstract (in Italian) Il Cell Transformation Assay (saggio di trasformazione cellu- 
lare) è un metodo in vitro per saggiare la carcinogenicità di una sostanza chimica. 
In un recente contributo di un gruppo di esperti internazionali, creato per migliorare 
l’analisi dei dati provenienti dal saggio con la linea cellulare BALB/c 3T3, sono 
stati raccomandati due classi di modelli nel paradigma frequentista. In questo la- 
voro viene sviluppato un modello Bayesiano per i risultati potenziali allo scopo di 
stimare l’effetto causale di alcune concentrazioni di un candidato cancerogeno sul 
conteggio dei foci che crescono entro capsule Petri. La rianalisi di un caso di stu- 
dio reale viene realizzata per illustrare alcune limitazioni dei metodi correnti e le 
principali caratteristiche del modello proposto. 


Key words: Bayesian causal model, potential outcomes 


1 Introduction 


It has been estimated that annual cancer incidence will rise from 14 million in year 
2012 to 22 within the next 2 decades [1]. Chemical carcinogenicity is defined as 


Department of Statistics, Computer Science, Applications 
University of Florence 
e-mail: stefanini@disia.unifi.it 


949 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


950 Federico Mattia Stefanini 


the ability of a chemical substance, or a mixture of chemical substances, to induce 
cancer or to increase its incidence. Given the role played by environmental and 
chemical exposures, no wonder that the evaluation of chemical carcinogenicity has 
become a leading task in public health risk assessment during the last decades. 

Cell Transformation Assays (CTAs) are a family of in vitro methods for the iden- 
tification of potential chemical carcinogens. It has been shown that CTAs nicely 
correlate with rodent bioassay, which is considered the standard approach for car- 
cinogenicity testing [2]. The endpoint assessed in CTAs is the progression of cul- 
tured cells from immortality to tumorigenicity, as evidenced by formation of foci 
of multilayered and disorganised cells, growing over the surrounding regular cell 
monolayer. Therefore, the number of fully transformed cell colonies, called type III 
foci, grown within a Petri dish (experimental unit) after 4 weeks from treatment with 
the chemical under testing is the outcome of primary interest. 

In the following, a Bayesian model for potential outcomes is developed to esti- 
mate the causal effect of different concentrations of a chemical in a CTA. 


2 A Bayesian model 


The case study here considered is a CTA experiment performed to test o-toluidine 
(CAS chemical registry number # 636-21-5). A total of eigth different concentra- 
tions and the negative control were considered, and they are (ug/ml): 0 (negative 
control, i = 0), 20 (i = 1), 100 (i = 2), 200 (i = 3), 500 (i = 4), 800 (i = 5), 1000 
(i = 6), 1200 (i = 7), 1750 (i = 8). A total of 90 Petri dishes containing BALB/c 
3T3 cells sampled at the same passage from the original cell culture were treated af- 
ter random assignment of each concentration to n; = 10 dishes (replicates) for each 
i. All experimental units received protocol ingredients taken from the same batch, 
including medium and serum. After 4 weeks from treatment, Petri dishes were vi- 
sually scored under a light microscope and the number of type III foci within each 
dish counted. 

Following Rubin’s framework for causal inference [3], potential outcomes are 
introduced for every treatment and experimental unit under consideration. Let ES z 
be random variables representing the potential number of foci within Petri dish 
k= 1,...,n under treatment (concentration ) į = 0, 1,...,L. A plausible size for the 
sample space Qy <i> is around 30, thus Qy<i> = {0,1,...,30}, because the available 
physical space on a Petri dish is limited. 

The potential outcomes referred to concentration i define the vector Y<!>. Let 
W=(M,... Wi)? be the vector of indicators of treatment assignment, with sample 
space Qw, = {0,...,L}. Given that CTAs belong to the class of randomized ex- 
periments, the assignment mechanism is ignorable and characterized by unit-level 
probability of treatment assignment in the interval (0,1), in particular the probabil- 
ity mass function of vector W that represents the assignment mechanism is: 
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-1 
e Ee ý ) (1) 
ninz...NL 

for all W satisfying Y}_,Z;(Wx) = ni for each i. Under row (unit) exchangeability of 

matrix (Y<°*,...,¥ </>) the joint distribution of potential outcomes is: 

n 
P OS, ner PS ge = [Tew sere yo | 0)p(0) dé (2) 
k=1 


where @ is a vector of model parameters belonging to the parameter space ©. 

The elicitation of conditional distributions for potential outcomes given model 
parameters (eq. 2), the so called science, should take into account the main pro- 
cesses driving the emergence of foci. Even if it is not carcinogenic, a chemical may 
exert a toxic effect on cultured cells, thus causing a reduction in the final number of 
type III foci. If a chemical is carcinogenic then it is expected to stimulate the emer- 
gence of foci, but this driving force also depends on concentration: too low doses 
are ineffective, too high doses are often cytotoxic. Despite that concentrations are 
selected to be within a convenient range, it is quite difficult to anticipate any cor- 
relation between potential outcomes. For these reasons, conditional independence 
among potential outcomes is here assumed: 


L 
PY, HS, 0) = J JPE | 8)p(0)) (3) 
i=0 


where the joint distribution of model parameters is factorized into marginally inde- 
pendent subvectors, 0 = (00,...,0;,...,0L). 
At the end of the experiment, the vector of observed potential outcomes is: 


T 
y<obs> — È DI Ye” (W, j) mki asan 


while the collection of vectors C<8> = {y <™is:0> |, y <misL>} with 


“i 0 Wii, k=1,..,î} 


and i= 0,...,L, contains missing potential outcomes. The conditional predictive 
distribution p(C°"> | y <°%bs> W) is exploited to impute missing values. 

Three causal estimands of particular interest are finite sample averages of indi- 
cators, with i > 1: the probability of positive effect on the treated (PPET) units, 


n n 
TPPET = L Iy<> —y<0>>0 A W=i} ME, We) / L l(W), 
k=1 k=1 
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the probability of null effect on the treated (PNET) units, where the indicator of the 
event is /{y<i>_y<0>0 n man (VE VE, We), the probability of cytotoxic effect 
on the treated (PCET) units, where /{,,<i>_y<0>20 a m=i} Vi’? Ye” Wy). 

Further structure may be imposed on the model after borrowing context and as- 
sumptions settled by an international expert group (European Centre for the Valida- 
tion of Alternative Methods, ECVAM), in particular: 


1. the number of Petri dishes at each concentration is typically 10 (never below 9); 

2. the number of levels for the concentration typically ranges from 3 to 7; 

3. focus-inducing chemicals are expected to show non-monotone dose-concentration 
relationships, mostly due to cytotoxicity at higher concentrations; 

4. positive controls are not informative; 

5. at small concentrations the empirical distribution of counts may be degenerate, 
typically at zero; 

6. concentrations have to be considered as levels of a qualitative factor, although 
originally on a quantitative scale (ug/ml). 


ECVAM’s experts recommended two tentative classes of models, the first one is a 
Normal model for Nishiyama-transformed counts, x = /y+vT+y, and the sec- 
ond one is a Negative Binomial model for original counts. Unfortunately, the two 
recommended classes did not seem suited to our case study (Section 3). 

ECVAM’s committee proposed two family of distributions allowing asymme- 
try (original scale) and smooth changes of probability value in contiguous (trans- 
formed) counts. By introducing latent variables X£ ~ Beta(x | G;, B;) in the Beta 
family, we essentially maintained the original belief while gaining in flexibility: val- 
ues of variance smaller than the mean became possible. The probability of observing 
count y;,j is thus is rg =y] = aa Beta(x | oj, Bj)dx. A weakly informative 
initial distribution was elicited for marginally independent model parameters, with 
a; ~ Uniform(1, 1000) and p; ~ Esponential(0.01), i= {0,...,L}. 


3 Results and discussion 


Computations were performed in R! using RStudio* and the following packages?: 
MASS, fitdistrplus, rjags,coda, knitr. 

In the first step of the analysis, Normal models for Nishyama-transformed counts 
were considered. Note that after transformation, null counts are mapped to 1. From 
unbiased point estimates of model parameters at each concentration i, plug-in esti- 
mates of probability values P[X; < 1 | fi;, 67] were calculated, and for concentrations 
from 0 to 200 they resulted well above 0.15, that is: 0.1904, 0.2265, 0.2673, 0.2734. 


l https://www.R-project.org/ 
2 http://www.rstudio.com 
3 https://cran.r-project.org/web/packages/ 
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For concentrations equal to 500 and 800 estimates were 0.1470 and 0.0550 respec- 
tively. Only for the last 3 concentrations the point estimates were below 0.02. Fit- 
ting distributions by maximum likelihood always reduced the the above estimated 
probability values of about 0.01. Quantile-quantile plots (not shown), even if based 
on just 10 observations, detected clear departures of Nishiyama-transformed counts 
from normality for all concentrations smaller than 800. 

Note that no concentration showed counts all equal to zero (or to one), an event 
of appreciable probability in many case studies, therefore it was possible to obtain 
the unbiased point estimate of the variance. For this reason, we did not consider 
the recommended artificial increase of sample size by one observation equal 1, an 
action that would have determined an increase of sample size of about 10% at each 
concentration. Furthermore, in case all observations are equal to one, such artificial 
change of observed counts is not even uniquely defined. Given the role played by 
the predictive distribution in the Bayesian causal model, and therefore by the model 
for observations, the Normal model for Nishiyama-transformed counts seemed un- 
satisfactory and therefore was not considered further. 

In the second step, we considered the class of Negative Binomial models. The 
optimization of likelihood functions at each concentration often failed due to diver- 
gence of the scale parameter towards infinity. Even upon termination, it was some- 
times impossible to calculate standard errors, or in other cases estimated values were 
huge. Similar failures were observed using other algorithms, for example iterated 
moment matching. Indeed, a small sample size at each concentration (n; = 10 Vi) 
and the presence of sample variances often smaller than sample averages made the 
optimization hard, given that the Negative Binomial family is not suited to under- 
dispersed count data. All things considered, the class of Negative Binomial models 
seemed unsatisfactory for the case study at hand and therefore it was not considered 
further. 

The proposed Beta latent model was fitted by MCMC: a sample of 1 x 10° real- 
izations from the final distribution of model parameters was collected after thinning 
one chain by 4. The initial burn-in consisted of 10'000 iterations. Values of differ- 
ence between pair of counterfactuals in the numerator of TEET i=1,...,8 were 
saved and further processed to obtain their distribution conditioned to observed out- 
comes at each concentration. One-chain output diagnostics did not suggest lack of 
convergence. In Table (1), estimated probability of carcinogenic/null/cytotoxic ef- 
fects are shown, based on a sample of 1 x 10° imputed counterfactuals for each con- 
centration. The odds of a carcinogenic effect of o-toluene at concentrations 1000 
and 1200 is about ten. The 5% quantile of the distribution of odds for a carcino- 
genic effect of o-toluene is equal to 2.3333 at both concentrations 1000 and 1250, 
thus they are both well above 1. 
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Table 1 Estimated probability values of causal effects. 


Concentration <i> 20 100 200 500 800 1000 1200 1750 

Wii 0.403 0.454 0.441 0.324 0.194 0.032 0.024 0.107 

tizi 0.386 0.417 0.391 0.365 0.254 0.067 0.053 0.147 

TSZT 0.211 0.129 0.168 0.311 0.552 0.901 0.923 0.746 

odds — EL 0.266 0.148 0.202 0.451 1.232 9.101 11.987 2.937 
a) 


4 Conclusions 


We developed a Bayesian causal model based on latent Beta distributions to over- 
come limitations found in alternative proposals when applied to the o-toluene case 
study. Instead of exploiting beliefs by adding virtual observations, prior information 
was formally introduced. Causal estimands like PPET were formulated to restrain 
the assessment by excluding the magnitude of differences, but motivations of this 
choice will be detailed elsewhere. Further similar estimands might address effects 
due to the increase of concentration, an important issue for applications not reported 
in this work. 

The limited sample size and the small number of concentrations characterizing a 
typical CTA make the assessment of model assumptions hard. It is clear that alter- 
native, and more general, latent models might be formulated, for example one that 
allows for rare but very extreme counts. These outliers were absent in our case study 
but they are not so rarely observed in CTAs. Replicates of the same experiment per- 
formed on a chemical in different laboratories represent an opportunity to increase 
sample size, at least after properly considering transportability. 

Finally, a battery of positive controls made by known genotoxic and nongeno- 
toxic carcinogens, for example 3-methylcolantrene, could be introduced to study 
the variability in the response of experimental units to treatments. 
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Estimation and Inference of Skew-Stable 
distributions using the Multivariate Method of 
Simulated Quantiles 


Stima e inferenza per i parametri delle distribuzioni 
Stabili asimmetriche utilizzando il metodo dei quantili 
simulati 


Stolfi Paola and Bernardi Mauro and Petrella Lea 


Abstract The multivariate method of simulated quantiles (MMSQ) is proposed as 
a likelihood-free alternative to indirect inference procedures that does not rely on 
an auxiliary model specification and its asymptotic properties are established. As a 
further improvement we introduce the Smoothly clipped absolute deviation (SCAD) 
£,—penalty into the MMSQ objective function in order to achieve sparse estimation 
of the scaling matrix. We extend the asymptotic theory and we show that the sparse— 
MMSQ estimator enjoys the oracle properties under mild regularity conditions. The 
method is applied to estimate the parameters of the Skew Elliptical Stable distribu- 
tion. 

Abstract In questo lavoro viene proposto il metodo dei quantili simulati multivariati 
che rappresenta una valida alterativa alle procedure di inferenza indiretta e che non 
richiede la specificazione di un modello ausiliario e vengono dimostrate le proprieta 
asintotiche dello stimatore. Allo scopo di indurre una stima sparsa della matrice di 
scala introduciamo inoltre la funzione di penalita SCAD all’interno della funzione 
obiettivo del metodo. Un ulteriore contributo é rappresentato dall’estensione della 
teoria asintotica nel caso sparso e dalla dimostrazione che lo stimatore soddisfa 
le proprieta ORACLE. Il metodo é applicato alla stima dei parametri della dis- 
tribuzione Stabile asimmetrica. 


Key words: Directional quantiles; Sparse regularisation; Skew Elliptical Stable 
distribution. 
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1 Introduction 


In this paper we extend the method of simulated quantiles (MSQ) of Dominicy and 
Veredas (2013) to a multivariate framework (MMSQ). The method of simulated 
quantiles like alternative likelihood—free procedures is based on the minimisation of 
the distance between appropriate quantile—based statistics evaluated on the true and 
simulated data. The MMSQ effectively deals with distributions that do not admit 
moments of any order, like the @—Stable or the Tukey lambda, without relying on 
the choice of a misspecified auxiliary model. The lack of a natural ordering in the 
multivariate setting requires a careful definition of the concept of quantile. Here, 
we rely on the notion of projectional quantile recently introduced by Hallin et al. 
(2010) and Kong and Mizera (2012). This notion of multivariate quantile makes the 
estimator flexible and it allows us to deal with non-elliptically contoured distribu- 
tions. As a further improvement we introduce the smoothly clipped absolute devia- 
tion (SCAD) ¢;—penalty of Fan and Li (2001) into the MMSQ objective function in 
order to achieve sparse estimation of the scaling matrix. The method is illustrated 
using several synthetic datasets from distributions for which alternative procedures 
are recognised to perform poorly, such as the Skew Elliptical Stable distribution 
(SESD) firstly mentioned by Branco and Dey (2001). 

The remainder of the paper is structured as follows. Section 2 introduces the 
sparse MMSQ estimator. Section 3 defines the Skew-Elliptical distribution of 
Branco and Dey (2001) while Section 4 presents simulated—data experiments to 
assess the effectiveness of the proposed method. Section 5 concludes. 


2 The Multivariate Method of Simulated Quantiles 


Let: 


(i) Y € Rf be arandom variable with distribution function Fy (-, 3), which depends 
on a vector of unknown parameters 0 C © € RÝ, and y = (y1,Y2,--- Yn)! be a 
vector of n independent realisations of Y; 

(ii) q5" = (aan. dl 396°) be a s x 1 vector of projectional quantiles at given 
confidence levels 7; € (0,1) with i= 1,2,...,s, and u € S°; 

(iii) Pug = PD (q3") be a b x 1 vector of quantile functions assumed to be contin- 
uously differentiable with respect to © for all Y and measurable for Y and for all 
VCO; 

(iv) qr" = (G7, g™,...,g%") and Sy = P (7Y) be the corresponding sample 
counterparts; 

and assume that Pu, cannot be computed analytically but it can be empirically 

calculated on simulated data. At each iteration j = 1,2,... the MMSQ compute 


Pua, = ELI Bao where Dao, is the function Pu,» computed at the rth sim- 


ulation path from Fy (. 90). The parameters are subsequently updated by min- 
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imising the distance between the vector of quantile measures Calcülated on the true 


observations ®, and that calculated on simulated realisations Di d; The subscript 
u denotes that those quantities depend on a set of directions that should be properly 
chosen. We establish consistency and asymptotic normality of the proposed estima- 
tor. The MMSQ estimator is then extended in order to achieve sparse estimation 
of a scaling matrix X. Specifically, the SCAD ¢;—penalty of Fan end Li (2001) is 
introduced into the MMSQ objective function as follows 


b= argmin (4. - Di DI Ws (bu - Di no) +n) pa( lo; jl) (1) 


i<j 


where Wy is a b x b symmetric positive definite weighting matrix, X = (©; DIE jel 
is the scale matrix and p; (-) is the SCAD ¢;-penalty. By setting the tuning pa- 
rameter A = 0, equation (1) reduces to non sparse MMSQ estimator. We extend the 
asymptotic theory and we show that the sparse-MMSQ estimator enjoys the oracle 
properties under mild regularity conditions. 


3 Skew Elliptical Stable distribution 


In this Section we define the quantile-based measures and the optimal directions 
u € S”! for the parameters of the SESD distribution Y ~ SESD, (a, č, Q, 8) in- 
troduced by Branco and Dey (2001). For the shape parameter @, the locations é;, 
the skewness parameters 6; and scale parameters @;;, i = 1,2,...,m we consider 


__ 90.951 — 90.050 
= 90.751 — 90.25u 
My = 40.5u 
_ 40.95u + 90.05,u — 290.50 
90.95u — 90.050 
Su = 90.75u — 70.250; 


where u € S”—! defines a relevant direction. Once the quantile-based measures have 
been selected, we need to identify the optimal directions for each parameter. Let 
us consider the locations first. Because of the presence of skewness, the median 
computed along the canonical directions is not a good quantile measure for the 
locations. Therefore, we consider a transformation of the data in order to remove 
the skewness. The properties of the Skew Elliptical Stable distribution imply that 
Y- = —Y ~ SESD, (a, č, Q,—65) independent of Y, therefore it holds 


_Y+Y> 
v2 


~ SESDn (a, v26,9,0) , (2) 
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which means that the variable Z is symmetric and, up to a constant, it has the same 
location parameter of Y. Therefore, we choose, as informative measure for the lo- 
cations, the median of the transformed variable Z in equation (2). In order to es- 
timate the remaining parameters, we consider univariate marginals that have Skew 
Elliptical Stable distribution, i.e., Y; ~ SESD (a, é;, @, 6;), by construction. The 
quantile—based measures for the shape, skewness and for the diagonal elements of 
the scale matrix @;; are then computed along the canonical directions. 

Now we need to identify the optimal directions for the off-diagonal elements 
of the scale matrix Q. To this end, we consider the bivariate marginal variables 
Yij = (¥;,¥;)! for 1<i< j< m. It holds Yij ~ SESD) (a, ij Qij, dij), where 

Vi Oj; : ni 
ij = (6,6) and Q;; = Bs il while d;; = (6;, dj). Moreover, let Y;; ~ 
i i ij ij i i 
SESD» (a, &;j, Qij, —ôij) independent of Y;; and let us consider the same construc- 
a i ; Yij+Y;, ‘ 
tion introduced for the locations, that is the random variable Z;; = 4 Na) L, having 


distribution Z;; ~ SESD (a, V2E ,.2ij,0). Since Z;; is a symmetric variable we 


choose the optimal direction u* € S! such that 


u“ = arg max yuQ; ju. (3) 
x } 


ue 


4 Simulated-data experiment 


To illustrate the effectiveness of the MMSQ in dealing with parameters estimation 
of the SESD we consider a simulation example where we fix the dimension m = 
5 and œ = 1.70, while the location, shape and scale parameters are é = 0, 6 = 
(0,0,0,0.9,0.9) and 


0.25 0.2504 0 0 
0.25 0.5 04 0 0 

X= | 04 04 1 0 0 |, (4) 
0 0 0 2 2.55 
0 0 0255 4 


respectively. We consider two different sample sizes n = 500,2000 and we fix the 
number of simulated paths R = 5. Simulation results over 100 replications are re- 
ported in Table 1. Table 1 reports the bias (BIAS), the standard deviation (SSD) 
and the empirical coverage probabilities (ECP) obtained over 100 replications of 
the simulation experiment. Our results show that the MMSQ estimator is always 
unbiased, indeed the BIAS is always less 0.15. The SSDs are always small, in 
particular for n = 500 it is always less then 0.5. Moreover, the empirical cover- 
ages are always in line with their expected values. In order to apply the sparse— 
MMSQ we consider a simulation example of dimension m = 12, with n = 200 
and R = 5 where the location parameters are equal to zero, the shape parameters 
5 = (0,0,0.6,0,0,0,0,0,0,0.6,0.6,0)’, while we consider the same scale matrix as 
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n= 500 n= 2000 


Par. True BIAS SSD ECP BIAS SSD ECP 


a 1.70 -0.0068 0.0690 0.9500 0.0013 0.0320 0.9500 
ô 0.00 0.0048 0.0067 0.9400 0.0022 0.0010 0.1100 
& 0.00 0.0048 0.0063 0.9500 0.0082 0.0016 0.0100 
& 0.00 0.0040 0.0038 0.9100 0.0013 0.0005 0.0600 
ô 0.90 -0.0116 0.1648 0.9700 0.0180 0.0185 0.8100 
55 0.90 -0.0179 0.1649 0.9700 0.0167 0.0234 0.9200 
È 0.00 0.0016 0.0365 0.9600 0.0032 0.0218 0.9400 
& 0.00 -0.0029 0.0534 0.9700 0.0023 0.0286 0.9400 
& 0.00 0.0093 0.0757 0.9400 0.0065 0.0393 0.9500 
& 0.00 -0.0051 0.0703 0.9700 0.0041 0.0356 0.9500 
Es 0.00 -0.0059 0.1089 0.9200 -0.0040 0.0618 0.9400 

on 0.2500 -0.0126 0.0259 0.9400 -0.0027 0.0140 0.9800 
0 0.5000 0.0184 0.0596 0.9200 0.0003 0.0261 0.9400 
3; 1.0000 0.0038 0.0998 0.9700 0.0166 0.0538 0.9500 
Or, 2.0000 -0.1397 0.3571 0.9300 -0.1571 0.1700 0.8800 
05; 4.0000 -0.4342 0.6637 0.9100 -0.1142 0.3980 0.9600 
@2 0.7071 -0.0438 0.1336 0.9400 -0.0345 0.1055 0.9100 
@3 0.8000 -0.1043 0.1487 0.9200 -0.0173 0.1050 0.9800 
@4 0.00 0.0075 0.0256 0.9300 0.0018 0.0148 0.9400 
O15 0.00 0.0085 0.0445 0.9700 0.0040 0.0170 0.9400 
@23 0.5657 -0.0851 0.1680 0.9300 -0.0323 0.1255 0.9700 
@4 0.00 0.0049 0.0306 0.9600 0.0032 0.0154 0.9200 
Ors 0.00 0.0076 0.0414 0.9300 0.0053 0.0172 0.9600 
@34 0.00 0.0047 0.0277 0.9100 0.0022 0.0151 0.9300 
O35 0.00 0.0100 0.0332 0.9500 0.0032 0.0151 0.9300 
@s 0.9016 -0.0727 0.0785 0.8200 -0.0552 0.0573 0.9300 


Table 1 Bias (BIAS), sample standard deviation (SSD), and empirical coverage probability (ECP) 
at the 95% confidence level for the locations u = (41, Uo,.--, Ha), scale matrix Q = {oij} with 
i,j =1,2,...,d andi < j, tail parameter œ = 1.70 and skewness parameter 6;, i = 1,2,..., d of the 
Skew Elliptical Stable distribution in dimension 5. The results reported above are obtained using 


100 replications. 


in Wang (2015) and reported below 


25 

0.239 0.117 0 0 0 0 0 0.031 0 0 0 0 

0.117 1.554 0 0 0 0 0 0 0 0 0 0 
0 0 0.362 0.002 0 0 0 0 0 0 0 0 
0 0 0.002 0.199 0.094 0 0 0 0 0 0 0 
0 0 0 0.094 0.349 0 0 0 0 0 0 —0.036 
0 0 0 0 0 0.295 —0.229 0.002 0 0 0 0 
0 0 0 0 0 -0.229 0.715 0 0 0 0 0 

0.031 0 0 0 0 0.002 0 0.164 0.112 —0.028 —0.008 0 
0 0 0 0 0 0 0 0.112 0.518 —0.193 —0.09 0 
0 0 0 0 0 0 ©  —0.028 —0.193 0.379 0.167 0 
0 0 0 0 0 0 0  —0.008 —0.09 0.167 0.159 0 
0 0 0 0 —0.036 0 0 0 0 0 0 0.207 


(5) 


As regards the simulated example in dimension m = 12 we plot in Figure 1 the 
images displaying the band structure of the true estimated scale matrices are very 
close. 
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Fig. 1 Images displaying the band structure of the true (left) and estimated (right) scale matrices 
of the simulated example in dimension m = 12. 


5 Conclusion 


In this paper the problem of parameter estimation and inference of Skew—Stable 
distributions has been approached using the multivariate method of simulated quan- 
tiles. Moreover, since as the number of dimensions increases the course of di- 
mensionality problem prevents any effective inferential procedure we introduce the 
sparse-MMSQ estimator and we prove that the estimator enjoys the oracle proper- 
ties under mild regularity conditions. The MMSQ and the sparse-MMSQ have been 
applies to the problem of estimating the parameters of the multivariate Skew—Stable 
distribution introduced by Branco and Dey (2001). Our simulation results show that 
the proposed methodology effectively achieve sparse estimation of the scale param- 
eter. 
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Sparse Indirect Inference 
Inferenza indiretta sparsa 


Stolfi Paola and Bernardi Mauro and Petrella Lea 


Abstract In this paper we propose a sparse indirect inference estimator. In order to 
achieve sparse estimation of the parameters, the Smoothly Clipped Absolute Devi- 
ation (SCAD) ¢;—penalty of Fan and Li (2001) is added into the indirect inference 
objective function introduced by Gouriéroux et al. (1993). We derive the asymptotic 
theory and we show that the sparse—Indirect Inference estimator enjoys the oracle 
properties under mild regularity conditions. The method is applied to estimate the 
parameters of large dimensional non-Gaussian regression models. 

Abstract In questo lavoro si propone un metodo di stima indiretta sparsa. A tal fine 
la funzione di penalità SCAD-€, di Fan and Li (2001) è introdotta nella funzione 
obiettivo del metodo di inferenza indiretta di Gouriéroux et al. (1993). Sotto usuali 
condizioni di regolarità vengono inoltre dimostrate la consistenza e la Normalità 
asintotica unitamente alle proprietà di stimatore ORACLE. Il metodo è illustrato 
con l’applicazione alla stima di modelli di regressione lineare con distribuzione 
non—Gaussiana del termine di errore. 


Key words: Indirect inference; sparse regularisation; SCAD penalty, stable non— 
Gaussian models. 


1 Introduction 


Indirect inference (II) methods (Gouriéroux et al. 1993, Gallant and Tauchen, 1996) 
are likelihood-free alternatives to maximum likelihood or moment-based estima- 
tion methods for parametric inference when a closed-form expression for the den- 
sity is not available. Throughout the paper we consider the following dynamic model 
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Yi =F (Yir—1; XU, Ò) (1) 
$(U-1,8,0), (2) 


Ur 


where x; are exogenous variables whereas u, and & are latent variables. We assume 
that: (i) x; is an homogeneous Markov process with transition distribution Fo inde- 
pendent of & and ur; (ii) the process & is a white noise whose distribution Go is 
known, and (iii) the process {y;,x,} is weekly stationary. We further assume that 
the joint density function of the observations fy Xr}, is not known analytically. 
The II method replaces the maximum likelihood estimator of the parameter & in 
equations (1)-(2) with a quasi-maximum likelihood estimator which relies on an 
alternative auxiliary model and then exploits simulations from the original model to 
correct for inconsistency. Specifically, let Qr (yr,xr, B) the auxiliary criterion func- 
tion, which depends on the observations {y;,x;};—, and on the auxiliary parameter 
B e BC RI, such that limy_,.. Or (y7,xr,B) = Q- (Fo, Go, do, B), a.s., where Up is 
the true parameter of interest, then 


Br = argmax Qr (yr,xr,B). (3) 
PEB 


Under the additional assumptions that the limit criterion is continuous in B and 
has a unique maximum fo, then the estimator Br is a consistent estimator of Bo, 
that is unknown since it depends on Fo and Up that are unknown. To overcome this 
problem, the II method simulates, for each value of è, H paths jh forh=1,2,...,H 
and computes the QML estimate BE for the auxiliary model in equation (3) and 
subsequently minimises the following objective function 


/ 
2 ca 14. Lou, Alta 14. 
dr = argmin(fr-Y BF) Êr | Êr- = } Br |, (4) 
di H i H {21 


for an appropriately chosen positive—definite square symmetric matrix Êr. Indi- 
rect estimator are consistent and asymptotically Normal under mild regularity con- 
ditions, see Gourieroux et al. (1993). The most important condition concerns the 
binding function that maps the parameter space of the auxiliary model onto the pa- 
rameter space of the true model 


b(F,G, 0) = argmax Qr (F,G,0,B), (5) 
BEB 


must be one-to-one. We further assume that gh (Fo, Go, +) is of full-column rank. 
In the following Section we introduce the Sparse-II estimator. 
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2 Sparse indirect inference 


In order to achieve sparse estimation of the parameter 8 we introduce the Smoothly 
Clipped Absolute Deviation (SCAD) ¢;-penalty of Fan and Li (2001) into the II 
objective function. The SCAD function is a non-convex penalty function with the 
following form 


Aly if |< 
2 : 
pain) = 4 ar (cdl) abaya <ysaa (6) 
cee ifa < |y], 


which corresponds to quadratic spline function with knots at A and aA. The SCAD 
penalty is continuously differentiable on (—%;0) U (0; 00) but singular at 0 with its 
derivatives zero outside the range [—aA; aA]. This results in small coefficients being 
set to zero, a few other coefficients being shrunk towards zero while retaining the 
large coefficients as they are. The Sparse II estimator minimises the penalised II 
objective function, as follows 


v= argmin D* (8), (7) 


where 
+ Ò 1# ph sr ĝ LE dh 
D*(8)= Pr- g} Pr Qr Pr- Pr +nY pa (l:l), (8) 
h=1 h=1 i 


where Ôr is a positive—definite square symmetric matrix. A similar approach in a 
different context has been recently proposed by Blasques and Duplinskiy (2015). 


3 Asymptotic theory 


As shown in Fan and Li (2001), the SCAD estimator, with appropriate choice of 
the regularisation (tuning) parameter, possesses a sparsity property, i.e., it estimates 
zero components of the true parameter vector exactly as zero with probability ap- 
proaching one as sample size increases while still being consistent for the non—zero 
components. An immediate consequence of the sparsity property of the SCAD esti- 
mator is the, so called, oracle property, i.e., the asymptotic distribution of the estima- 
tor remains the same whether or not the correct zero restrictions are imposed in the 
course of the SCAD estimation procedure. More specifically, let 0) = (dì. 9) be 
the true value of the unknown parameter 3, where d € R’ is the subset of non-zero 
parameters and 0 =0 € R* and let A = {i 10; € od X, we consider the following 
definition of oracle estimator given by Zou (2006). 
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Definition 1. An oracle estimator Boracie has the following properties: 


(i) consistent variable selection: limn P (An = A) = 1, where A, = {i: 8; € dl de}: 


(ii) asymptotic normality: y/n (dl oracle — vi) 4 N (0,2), as n + œ, where X is the 


variance covariance matrix of 04 i 


In the remainder the Section we establish the oracle properties of the penalised 
SCAD II estimator. To this end, the following set of assumptions are needed: 


(i) 
dQ 


sv 


Or:xr, Bo) — ae co (er, A). (9) 


is asymptotically normal with mean zero, and asymptotic variance—covariance 
matrix given by W = limr_,.. V (r); 
(ii) 


d 
Jim V (vr a (3. srr) ) = lo, (10) 


and the limit is independent of the initial values ts forh=1,2,...,H; 


(iii) 


lim cov (vr ap rh). VTLLE (Hier, A) ) =Ko, a) 


T-00 dp 
and the limit is independent of zh and zh for h# I; 
(iv) 
lim — 2 (9) fo) = HQ (Fy, Go, do. Bo), (12) 
Pp oh YT AT, ~ dBop' 0,90; U0; 


and the limit is independent of Zi 


The next Theorem states that the estimator defined in equation (7) satisfies the spar- 
sity property. 

Theorem 1. Given the SCAD penalty function pa (-), for a sequence of Ay such that 
An > 0, and \/nd, + %, as n + co, there exists a local minimiser è of D* (8) in (7) 
with |È — doll = 6p (n=). Furthermore, we have 


lim P (3° =0) =e (13) 


The following theorem establishes the asymptotic normality of the penalised SCAD 
II estimator; we denote by 9! the subvector of 3 that does not contain zero elements 
and by è! the corresponding penalised II estimator. 
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Theorem 2. Given the SCAD penalty function p} (|8;|), for a sequence A, + 0 and 
VÀ + © as n + co, then dà! has the following asymptotic distribution: 


va(d'- 99) 4n(0(14 5) w), (14) 


as n — co, where 
W = (b' (Fo, Go, do) QB" (Fo, Go, 90))! Wi (Y (Fo, Go, do) QB! (Fy, Go, 80) ~! 
and 

Wi =D (Fo, Go, 80) 247! (Io — Ko) Jp 2b' (Fy,Go, do), (15) 
where b' (Fo, Go, Vo) = 2b(Fo:G0,99) 
b (Fo, Go, do). 


is the first derivative of the binding function 


4 Sparse II algorithm 


The objective function of the sparse estimator is the sum of a convex function and a 
non convex function which complicates the minimisation procedure. Here, we adapt 
the algorithms proposed by Fan and Li (2001) and Hunter and Li (2005) to our 
objective function in order to allow a fast procedure for the minimisation problem. 
To this aim we consider the first order Taylor expansion of the penalty, for 0; ~ Vio 


pa (18) © pa (lol) + PACD (92 


2 
_ di 1 
2 Bio | do) ’ (16) 


where the first derivative of the penalty function has been approximated as follows: 


Pi (Biol) 


1 
[Pa (19)] = pa Asena) = “Tg È (17) 
when 0; 0. The objective function D* in equation (7) can be locally approximated, 


except for a constant term by 


D'(9)= (B-B4,) 2 (B-54) -+ x Paap h Bi) (3 — Bo) 


1 PENE 9h, (0 


Fz ‘HL os 2nd vo 


Do) + 5 'Pa, (90) è, 
(18) 
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= 3 A h (lð; 
where Bi, = ALI Bi, and P,, (8) = diag { ali D ide d! \ Then the first or- 


der condition becomes 


ƏD (8) _ OBS ala 3 
ao JÈ zs Ê(Ê Bi) 
op". 2 
rah Pag aay PP (9 — oy) + nb, (0)? 
ağ, BA 3 dh}, val Bi (1 
- JÈ O (B-B) +2 DES FL 79 (0-%) 
+nPa, (Bo) (8 — do) +nPa, (9) vo 
=0, (19) 
therefore 


1% OB} 21 E OBS D f 
= ay Jo CHL go "Pa (ob) 


h=1 


D È Pi 0 g Bi) nO) (20) 


The optimal solution can be found iteratively, as follows 


n db 1 H Opty _ x 
(K+1) _ ,9(k) 1 9A) dh , (k) 
4 4 p dd Cb aa + (0%) 
Rh 
EMA] o 
h=1 


and if ot) = 0, then ott) is set equal zero. When the algorithm converges the 
estimator satisfies the following equation 


JÈ Pa a (8 Bi) "Pa, = 0, (22) 


that is the first order condition of the minimisation problem of the Sparse—II estima- 
tor. 
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5 Tuning paramenter selection 


The SCAD penalty requires the selection of two tuning parameters (a, A). The first 
tuning parameter is fixed at a = 3.7 as suggested in Fan and Li (2001), while the 
parameter À is selected using the cross validation function 


K 1 i 1 H >y x k 1 H si 
cv(4)= X — (a- =D A, 2 (A SR Bi) » 83 


where È, x denotes the parameters estimate over the sample (Ux, Ti) \Tk with A as 
tuning parameter. Then the optimal value is chosen as A* = argminy CV (A), where 
again the minimisation is performed over a grid of values for A. 


6 Application 


Let y = (y1,y2,..., yr)! be the vector of observations on the scalar response variable 

Y, X = (x/,x},...,x/-) is the (nx p) matrix of observations on the p covariates, 

i.e., Xj = (Xj 1,%Xj2,--- ;Xj,p) and consider the following regression model 
y=176+Xy+€, E~ Sa (0,0), (24) 


where 17 is the T x 1 vector of unit elements, 6 € R denotes the parameter related 
to the intercept of the model, y= (7, %,..., W) is the p x 1 vector of regression 
parameters and Sg (0,0) denotes the symmetric a—Stable distribution (Samorod- 
nitsky et al. 1994) centred at zero with characteristic exponent @ € (0,2) and scale 
parameter o > 0. We further assume that the element of the vector of innovations 
€ = (€1,€2,...,€r) are independent i.e. £; 1 &, for any j # k and they are indepen- 
dent of x;, for / = 1,2,...,p. Indirect inference for Stable distributions has been 
previously considered by Lombardi and Veredas (2009). The Sparse—II method re- 
quires the definition of the auxiliary model as well as the metric used to compare the 
synthetic data generated by the method $ with the true data y. As auxiliary distribu- 
tion we consider the Student-t regression model defined in equation (24), with the 
only difference that the error term follows a Student-t distribution € ~ T (0, o, v). 
As regards the metric, we consider the Ly distance between the scores of the auxil- 


iary model evaluated at the true y and simulated J, i.e., ||V (8,5) -V (85) 13. In 


Table 1, we report the empirical inclusion probabilities of the regression parameters 
obtained over 1,00 replications of the @—Stable regression model defined in equa- 
tion (24), for two values of œ = (1.70, 1.95) with n = 250. The true parameters are 
defined in the column (Par.) of Table (24), while the scale parameter of the Stable 
distribution is held fixed at o = 0.05. Our simulation results confirm that the sparse 
Indirect estimator perform well in detecting zeros in linear non—Gaussian regression 
models. 
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Par. True EIP EIP Par. True EIP EIP 
a=1.70 a=1.95 a=1.70 a=1.95 

ô 1 0 0 Vi 0 0.6591 0.8919 
N 2 0 0 ne 0 0.7500 0.8378 
h 2 0 0 N3 0 0.8182 0.9459 
bh 3 0 0 Na 0 0.7273 0.9189 
Ns 1 0 0 Ns 0 0.7955 0.9730 
% 2 0 0 no 0 0.7273 0.8378 
% 3 0 0 vi 0 0.7727 0.8919 
v 1 0 0 ns 0 0.8182 0.9189 
% 2 0 0 no 0 0.8636 0.9459 
H% 3 0 0 ho 0 0.8636 1.0000 

no 0 0 0 vi 0 0.8636 0.9459 


Table 1 Empirical inclusion probabilities (EIP) evaluated over 1,00 replications for the regression 
parameters (ô, y) of the a—Stable regression model defined in equation (24). 


7 Conclusion 


In this paper we introduce the sparse indirect inference (SII) estimator and we ex- 
tend the asymptotic theory. Empirical properties of the estimator are evaluated by 
means of a simulation study where a moderately large linear regression model with 
non-Gaussian innovations is considered. Our results confirm that the SII estimator 
performs well in detecting non zero regressor parameters. 
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The ESSnet Big Data: Experimental Results 
Gli ESSnet Big Data: risultati sperimentali 


Peter Struijs!, Anke Consten, Piet Daas, Marc Debusschere, Maiki Ilves, Boro Nikic, 
Anna Nowicka, David Salgado, Monica Scannapieco, and Nigel Swier 


Abstract: In the ESSnet Big Data, 22 partners from 20 countries of the European 
Statistical System, the ESS, collaborate in exploring the possibilities of using big 
data as a source for official statistics. The research focuses on seven areas: (1) web 
scraping for statistics on job vacancies, (2) web scraping for obtaining enterprise 
characteristics, (3) use of smart meter data, (4) use of AIS data (maritime data), (5) 
use of mobile phone data, (6) use of big data for early estimates, and (7) combining 
data sources for statistics on the domains of population, tourism and border crossing, 
and agriculture. The paper presents the main results of the ESSnet after its first year, 
highlighting the opportunities and challenges of using big data for official statistics. 


Key words: Big Data, European Statistical System, ESSnet Big Data, Official 
Statistics 


1 Big Data and the European Statistical System 


The emergence of big data is having a big impact on organisations for which the 
production and analysis of data and information is core business. National Statistical 
Institutes (NSIs), which are responsible for official statistics, are such organisations. 
Official statistics are heavily used by policy makers and society as a whole, and the 
way NSIs take up big data will eventually have implications for all of society. 

The relevance of big data for official statistics stems from the exponential 
increase of data registered through networks of sensors, camera’s, public 


! Contact: Peter Struijs, Statistics Netherlands, p.struijs@cbs.nl. The co-authors are all work package 
leaders of the ESSnet. 
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administrations, banks, enterprises, mobile networks, satellites, drones, social 
networks, internet sites, etc. This not only creates many opportunities for improving 
official statistics, such as reporting on phenomena whose measurement used to be 
out of reach, but also profoundly influences the context in which statistics are 
produced, for better or worse. And there are many issues with big data that may have 
an impact on NSIs, such as on the required statistical methodology, the way data is 
obtained, privacy and ethical considerations, the need for an appropriate IT 
infrastructure, the skills needed to deal with big data, the quality of statistics based 
on big data, and the positioning of NSIs in the emerging data society. Official 
statistics are generally based on established, validated methods, but for big data new 
approaches are clearly needed [5]. 

The possible strategic impact of big data for official statistics was recognised by 
several NSIs and international organisations some years ago. In September 2013 the 
Directors-General of the NSIs of the European Statistical System (ESS), adopted the 
so-called Scheveningen Memorandum on Big Data and Official Statistics [1], in 
which a course of action was set out, including the drafting of an ESS action plan 
and a roadmap. The action plan, which has been worked out in the context of the 
ESS Vision 2020 [2], identifies nine themes: policy; quality; skills; experience 
sharing; legislation; IT infrastructure; methods; ethics/communication; and 
partnerships. It also recognised the need for carrying out concrete pilots. For the 
pilots a so-called ESSnet was created, the results of which are the subject of this 
paper. 

But what is, in fact, big data in the context of official statistics? This question 
was also considered at the international level. In official statistics, big data is 
generally considered as a data source. An attempt to define big data for statistical 
purposes was made by UNECE, the UN Economic Commission for Europe. 
Building on a definition by Gartner [4] it defined big data as follows [3]: “Big data 
are data sources that can be — generally — described as: high volume, velocity and 
variety of data that demand cost-effective, innovative forms of processing for 
enhanced insight and decision making.” 


2 The ESSnet Big Data 


An ESSnet is a project in which a number of ESS partners collaborate in order to 
pursue an ESS goal, with partial EU-funding. Usually, a so-called Framework 
Partnership Agreement (FPA) is established, after which one or more so-called 
Specific Grant Agreements (SGAs) are concluded. In the case of the ESSnet Big 
Data (henceforth: the ESSnet), the FPA is linked to a consortium of 22 partners, 
consisting of 20 National Statistical Institutes (NSIs) and two Statistical Authorities, 
and covers the period from January 2016 to May 2018. There are two SGAs, with an 
overlap in time: SGA-1 extends from February 2016 to July 2017, and SGA-2 from 
January 2017 to May 2018. For each of them, the grant has a value of 1 million euro, 
and funding is limited to 90%. 
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The overall objective of the project is to prepare the ESS for integration of big 
data sources into the production of official statistics. The consortium has organised 
the core of its work around a number of work packages (WPs), each WP dealing 
with one pilot and a concrete output. In SGA-1 there are seven of them: WP 1 Web 
Scraping / Job Vacancies; WP 2 Web Scraping / Enterprise Characteristics; WP 3 
Smart Meters; WP 4 AIS Data; WP 5 Mobile Phone Data; WP 6 Early Estimates; 
WP 7 Multi Domains 


The outputs of these pilots so far are described in the remainder of this paper. 
They have one thing in common: they cover the complete statistical process, from 
data acquisition to the production of statistical output. In addition, and in accordance 
with the general objective of the FPA, the pilots also consider future perspectives. 
Thus, all pilots comprise the following five phases: 


Data access 

Data handling 

Methodology and technology 
Statistical output 

Future perspectives 


Oye ew 


SGA-1 covers only some of the five phases for each of the WPs, the rest being 
covered by SGA-2. And the phases covered by SGA-1 are not the same for each 
pilot (WP), as for some areas it was possible to plan ahead further (in time and 
phases) than for other areas. In particular, WP 5 concentrated on data access issues 
in SGA-1 and could not plan further ahead, as data processing would depend on the 
results of the efforts to realise data access. Therefore, WP 5 was planned to end in 
December 2016, whereas the other WPs would continue into 2017. This explains the 
overlap in time of SGA-1 and SGA-2. 

Given the overall objective, the findings need to be generalised. This is done in 
SGA-2, for which a new WP is added, WP 8. This WP covers overarching aspects, 
in particular methodology, quality and IT infrastructure. 

The ESSnet has organised support in several ways. In addition to the IT 
infrastructure of NSIs, common IT facilities were ensured by subscribing to the 
UNECE Sandbox, a facility maintained by the Central Statistics Office (CSO) of 
Ireland and the Irish Centre for High-End Computing (ICHEC)’. In order to ensure 
that professional standards are met, a Review Board was created that systematically 
reviews the products of the ESSnet. Concerning communication and dissemination, 
the ESSnet uses a Mediawiki website. CROS Portal, the general ESS dissemination 
site, also links to the products of the ESSnet*. The products described in the sections 
below can be found there. After the first year of the ESSnet, a dissemination 
workshop? was held in Sofia, Bulgaria. 


2 http://www1.unece.org/stat/platform/display/bigdata/Sandbox 

3 https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/ 

4 https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en 

5 https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Project_meetings 
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3 Web Scraping: Job Vacancies 


Since this pilot involves each country taking its own specific approach, there are a 
lot of intermediate country specific results. However, general selection criteria have 
been identified for targeting portals for scraping. Taking into account the distinction 
between job vacancy and job advertisement, a conceptual model is proposed of how 
online job advertisements correspond to the target population (see Figure 1). In 
practical terms this may be defined as all vacancies that are available to be measured 
by existing job vacancy surveys. As well as providing a conceptual framework for 
understanding the coverage of job vacancies from online sources and how these 
relate to the measurement of all job vacancies, this approach may also provide the 
conceptual basis for an estimation framework. 


‘Ghost’ 


Target Population: All job vacancies Vacancies 


Figure 1: Conceptual model for measuring job vacancies from on-line sources 


4 Web Scraping: Enterprise Characteristics 


Six use cases have been identified in the pilot: (1) enterprise URLs inventory, (2) e- 
commerce in enterprises (about predicting whether or not an enterprise provides web 
sales facilities on its website), (3) job vacancies ads on enterprises’ websites, (4) 
social media presence on enterprises webpages, (5) sustainability reporting on 
enterprises’ websites (linked to the UN Sustainability Development Goals), and (6) 
relevant categories of enterprises’ activity sector (NACE) aimed at checking or 
completing statistical business registers. A common use case template was developed 
and has been used. For each use case, a pilot definition was performed and all of 
them were mapped to a general “logical architecture” (see Figure 2). Also, a report 
was produced on legal aspects related to web scraping of enterprise websites, aimed 
at showing the real possibilities for the NSIs to perform activities of web scraping. 
These appear to be generally favourable. 
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Figure 2: Logical architecture for software for web scraping of enterprise websites 


5 Smart Meters 


This pilot has produced a report on data access and data handling of smart meter 
electricity data. The report includes the results of a literature study and a survey on 
access to smart meter data, which was sent to the NSIs of all EU member countries 
in the spring of 2016, with 18 responses. It appeared that only two countries 
currently have access to data: Denmark and Estonia. Several countries were aware of 
substantial legal barriers, and it was unclear if market participants could even share 
data with each other. Some countries such as Poland are in the process of drawing up 
legislation that will enable smart meter data use. For two countries, Estonia and 
Denmark, the pilot has defined and assessed the quality of smart meter electricity 
data, and a synthetic dataset was analysed as well, aimed at generating demo output 
and developing and testing statistics and algorithms for situations where linkage to 
enterprise or household characteristics is necessary. The assessment of quality 
indicators comprised: under- and overcoverage; percentage of units that fail checks; 
percentage of units that are adjusted; percentage of units that are imputed; data 
periodicity; delay in data provision 


6 AIS Data 


This pilot investigates whether real-time measurement data of ship positions 
(measured by the so-called AIS system) can be used to improve the quality and 
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internal comparability of existing statistics and for new statistical products relevant 
for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a 
database with AIS data for official statistics, (2) deriving harbour visits and linking 
data from maritime statistics with AIS data, and (3) sea traffic analyses using AIS 
data. While the possibility of using AIS data from EMSA, the European Maritime 
Safety Authority, is still being investigated, AIS data from Dirkzwager, a private 
company, was used, and the data quality analysed. Visualisations were made, 
showing the coverage of the ships by the data, and showing the path of a ship 
through time. A method to build a reference frame of maritime ships was developed 
and software options considered (see Figure 3). 


Te 
so VE UNECE 
@ % Sandbox 
raw pre- 
data processed 
mu 
HD | 
Decoding/ = Data processing Data 
Conversion and reduction Analysis 
Alternatives: Alternatives: Alternatives: 
- Java (AISlib) - Hadoop (Pig/Hives/ - SAS 
- Python (AIS) RHadoop) - SPSS 


Figure 3: pre-processing, processing and storing the AIS data 


First results show that AIS data can be used as a backbone for maritime statistics. 
This is important, since the added value of running a pilot with AIS data at European 
level is linked to the fact that the source data is generic worldwide and data can be 
obtained at European level. 


7 Mobile Phone Data 


This pilot has focussed on data access, which will be needed for SGA-2. A 
preliminary analysis of the issues regarding the access to mobile phone data was 
made, which was the basis for the design of a questionnaire surveying the status of 
this access across the ESS. Belgium, Finland, France, and Italy were found to have 
succeeded in their negotiations to have access to a concrete mobile phone data set 
that can be used for SGA-2. Spain and Romania are still under contact pursuing this 
goal. A workshop was held in Luxembourg to bring together mobile network 
operators (MNOs), NSIs, Eurostat (the statistical office of the EU) and other 
stakeholders, including some other international organizations (UN, OECD, ITU, 
DG Connect, DG Digit). Basically five main groups of issues were identified 
regarding the access to mobile phone data, namely (i) the characteristics of the 
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MNO, (ii) the legal requirements, (iii) the access conditions, (iv) the data 
characteristics and (v) other aspects. 


8 Early Estimates 


The aim of this pilot is to investigate how a combination of multiple big data sources 
and existing official statistical data can be used in order to create existing or new 
early estimates for statistics. Two pilots were chosen. The first was web-based sales 
inquiries for the aim of nowcasts of turnover indices. In this context, Statistics 
Finland has examined the nowcasting performance of large dimensional models for 
the year-on-year growth rate of the turnover indexes, considering the main sectors of 
the Finnish economy. The Slovenian NSI has prepared and tested nowcasting 
methods on their data, and has also prepared the application in R environment 
(together with a manual), which will enable also other countries to test their data on 
the nowcasting model. The second pilot concerned social media data, newsfeeds and 
survey data for the aim of a Consumer Confidence Index (CCI). Furthermore, for 
eight statistical domains the potential of combining multiple big data sources and 
existing official statistical data was investigated in order to create existing or new 
early estimates for statistics. It was concluded that the most promising statistics for 
which early estimates could be produced are statistic related to early economic 
estimates. A proposal along these lines was prepared for SGA2. 


9 Multi Domains 


The aim of this pilot is to find out how a combination of big data sources, 
administrative data and statistical data may enrich current statistical output. Three 
statistical domains are being investigated: (1) population, (2) tourism/border 
crossings, and (3) agriculture. For population, three areas are looked at: daily (life) 
satisfaction, the moods of population associated with public events (e.g., Brexit, 
voting), and morbidity areas (e.g., flu). For tourism/border crossings, a number of 
possible data sources have been identified and investigated, for instance with regard 
to traffic intensity information. For agriculture, the focus is on satellite data 
applications, in particular monitoring of crop conditions, seasonal changes, soil 
properties and mapping tillage activities. For each (sub)domain a big data framework 
is developed (see Figure 4). 
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Figure 4: Big data framework for daily (life) satisfaction 


10 Outlook 


In the remaining time of the ESSnet, the pilots will deal with the phases not covered 
so far, which implies less focus on data access and more focus on statistical outputs. 
In particular, attention will be paid to the meaning of the results for the future of ESS 
statistics. The pilot for mobile phone data is the only one that did not involve actual 
data handling in SGA-1, as data access issues had to be solved first. This is generally 
considered to be one of the most promising new data sources for statistics. 

Apart from the results of the individual pilots, the ESSnet is also going to draw 
conclusions about methodology, quality and IT infrastructure for big data in general, 
building on the results of the seven pilots as well as on experiences and results 
described in the literature. Fascinating questions are to what extent the more 
traditional, established corpus of statistical methods can be deployed or needs to be 
supplemented with new approaches when dealing with big data, and to what extent 
the findings on methodology, quality and IT infrastructure are source dependent or 
can be generalised. 
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Smart view selection in multi-view clustering 
Selezione intelligente di viste per clustering multi-vista 


Jérémie Sublime 


Abstract Multi-view clustering is a data mining task in which a data set is processed 
by several algorithms observing different features of the same data. The main diffi- 
culty of this task is to detect whether or not sharing informations between the views 
may be beneficial: Some views contain mostly noisy features, while others simply 
contain features which lead to different clusters. One of the challenges of multi-view 
clutering is therefore to find which views should work together or not. Within this 
context, in this article we propose an optimisation method which sets the exchange 
weights between the different algorithms based on the maximization of the global 
likelihood function. 

Abstract II clustering multi vista è una tecnica di data mining in cui un insieme 
di dati viene elaborato da diversi algoritmi analizzando diverse caratteristiche (0 
viste”) degli stessi dati. La difficoltà principale di questa tecnica è la capacità 
di rilevare quali informazioni possono essere utili da condividere tra i diversi al- 
goritmi: Alcune viste contengono per lo più rumore, mentre altre contengono sem- 
plicemente le caratteristiche specifiche dei diversi cluster. Una delle sfide del clus- 
tering multi-vista è, quindi, quella di trovare quali viste dovrebbero essere messe in 
comune e quali no. Il presente articolo propone un metodo di ottimizzazione basato 
sulla massimizzazione della funzione di verosimiglianza globale, ottenuto asseg- 
nando il peso di scambio tra i vari algoritmi. 


Key words: Multi-view clustering, optimization 


1 Introduction 


Data clustering is an exploratory data mining task that aims at discovering hidden 
intrinsic structures in a data set by forming groups (clusters) of objects that share 
similar features. Data clustering is usually considered more difficult than supervised 
classification because of its exploratory nature which makes the results difficult to 
rate. Nowadays, with data being abundant, most data sets tend to exist based on 
different representations. This gave birth to the field of multi-view learning, a re- 
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cent paradigm which consists in learning and analyzing data using several views 
with redundant features. While this increased number of information has proved 
very beneficial in the context of supervised learning, multi-view clustering remains 
problematic in the sense that since it is difficult to assess the quality of a clustering 
result in an unsupervised context, it is equally difficult to rate the quality of a view 
and therefore to know whether or not an exchange of information between different 
views will be beneficial or detrimental. 

Several unsupervised multi-view methods are available in the literature [13, 11, 
2] with applications ranging from the clustering of distributed data [9, 3] to collabo- 
rative clustering [12, 6]. All of these models have in common that they mention the 
importance of properly weighting the exchange links between the different views in 
order to avoid “negative collaborations” [4, 10, 6, 5]. These so called negative col- 
laborations may come from different reasons such as the lack of common structures 
in the different views, or a too large number of noisy features in some of them. 

Within this context, the aim of this article is to propose an optimization method 
to detect which views should or should not exchange their information with the goal 
of minimizing the risk of negative collaborations. 


2 General form for multi-view clustering likelihood functions 


Let us define J algorithms {.0/!,--- ,.o/}, with each of them processing a different 
view among {X!,--- ,X/}. These views describe the same N elements with different 
real features that may be redundant from one view to another: Vi X icx, Xi = 
{xi yo Nb x € R%, From there, each algorithm ' tries to find a local solution 
SÍ (hard of fuzzy) based on explicit or implicit distribution parameters @! and using 
information exchanged with the other views. Each algorithm is therefore defined by 
its subset, its solution and its distribution parameters. To this simple model, we add 
the exchange weights 7;,; that define how much information will be transferred from 
algorithm // to algorithm /!. Therefore, we have: d' = {X}, Si, O' t. ;}. 

Using this model, most multi-view model found in the literature [11, 2, 9, 12, 
6] try to optimize a variant of the fitness function given in Equation (1), where 
-Z (Xİ, Sİ, O) is a local term usually derived from a log-likelihood or a quality index 
specific to each algorithm .o/', and Aj,; is a divergence term assessing the pairwise 
difference or divergence between the models or partitions of two algorithms $ and 
A). 


J 

{Sop Pop} = argmax )” (zason = L Ti 4) (1) 
O i=l j#i 

Depending on the multi-view framework, the divergence term A;,; may be based 

on the partitions [12]: A; ; = A(S',S/), or on the distribution parameters and proto- 

types: Aj; = A (©',@/); and may regardless be concretely computed as an entropy, 

a Kullback-Leibler divergence, or a custom distance function between prototypes. 
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Please note that depending on the function used, the quality A; į = A; is not always 
true. 


As stated in the introduction, in this article we will mostly be interested in the 
exchange weights Tj; and how to set them up to ensure the best possible results. 
The available literature on this subject is rather slim when it comes to unsupervised 
multi-view or collaborative learning: In some works, the problem is simply ignored 
and the weights set to 1 by default [2, 12], while other works describe setting up the 
weights manually [6]. Finally, in the work closest to this article, a method applicable 
to several collaborative clustering algorithms is proposed relying on the quality and 
diversity of the partitions to set up the weights based on regression computed on 
points cloud [10]. 

The weakness of this later approach is that it relies on clustering internal indexes 
that may be biased toward certain views or algorithms depending on the index and 
distance function used. Furthermore, the specific adjustment of the parameters de- 
pends on the data themselves and is subject to trials and errors. To solve this prob- 
lem, in the next section we will propose a sound mathematical-based approach rely- 
ing solely on the fitness function described in Equation (1) to optimize the exchange 
links. Our proposed approach therefore has the advantage to be more generic and 
unbiased. 


3 Optimization under KKT conditions 


The method that we propose consists in optimizing Equation (1) under the Karush- 
Kuhn-Tucker conditions (KKT) [7], with the goal of finding the ideal 7;;. In a second 
step, we will interpret the expression found with the aforementioned method. 

We first want to make the hypothesis that all divergence terms A; ; are normalized 
between 0 and 1. Therefore, for any divergence term, there is an opposite consensus 
term C;,; such that A; ; = 1 — C; j. If we inject this term into Equation (1), we get 
Equation (2) bellow. 


J 
{Sop Pop} = argmax )" L(X',S',0') — LV Tit LV Tri Gy (2) 
SO i=l ifi i#i 
From there, finding the Tj; that maximize this equation with the constraint 
Lia Tj,i = 1 gives us the following system: 


T = argmaxy E1 Digi iC, 
Vi Da tii=l (3) 
Vi, J) Tji > 0 

Note that the middle term from Equation (2) has been discarded because it does 


not depend on the data or partitions, and is constant under the second constraint on 
the weights. In the same way, the first term does not contain any Tj, and is therefore 
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also removed from the problem. From this system, by using the Lagrange multipliers 
we get the following KKT conditions : 


(a4) tji20 
(b) Via Tji= 1 
VGA iZ js (Cc)  Aji>0 (4) 
(d) Ti Aji =0 
(e) Ci,j—Aji-—Vvi =0 


With (c) and (e), we have: 


Aji =Cij —Vi 20 (5) 


Let us suppose that there is a k so that Tg ; > 0. Then with (d), we have: Ay; = 0. 
And with Equation (5), we have: v; = Cix. Then, using this information we can say 
that: 


ii #0 CGir= Gi A;i=0 
Vieks ee © 
Tj,i = 0 Aji = Cij Cia = 0 


From the second line of Equation (6), we can conclude the following: 
Tji # 0 = Ci,j = max Ciy (7) 


Then, if we use (d) and (e), we have: 


Tj i(Ci,j —vi)=0 (8) 
If we sum Equation (8) over j and use (b), we have: 
J 
Vi= y Ti Cij (9) 
j#i 


For Equation (9) to be correct while respecting the constraints given in Equations 
(6) and (7), the only solution is: 


1 i rete: È 
VÉLT = ae if Ci j = max: (Cix) 


(10) 
otherwise 


It is possible to get a relaxed version of this result by modifying the system from 
Equation (3) as follows: 


T = argmaxr Dy E j ti Cij 
vi Li(t)P=1, pen, p>l (11) 
y(i, j) Tji = 0 
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It is worth noting that using this new system does not exactly optimizes Equation 
(2) since the middle term cannot be ignored using the new constaints. Instead, we 
are trying to optimize Equation (12), which can be seen as a ”’conjugate problem in 
spirit” since we want to favor the consensus instead of penalizing the divergences 
between models. 


J 
{Soprs Oopr} = argmax (20°.9.0 +E sic) (12) 
SO i=l ii 

However, mathematically speaking the 7; ; of this system are not the same ones 
as in Equations (1) and (2). For the sake of completeness we will compute them and 
show that they can still be used in the original equation. With this system, we get 
the following new KKT conditions: 


a) Ti 20 
b) Yd)? =1, p> 
) Xji >0 (13) 


d) Tji Aji =0 
e) Ci.j — Aji — Vi (p (tP!) =0 


If we consider the case Tj; 7 0 and ii = 0 in Equation (d), then with (e) we 
have: 


( 
( 
VGA IAI (c 
( 
( 


1 
DER pal 
t= ( Chi i (14) 
P: Vi 
From Equation (14) and (b), we have: 

P 

=p p =p Gi NPL 
1 = (p: vi) P $ (Cip) PT = (vi)! (<2) (15) 

ifi if \ P 


1 VARA 
Vi = > (ge) (16) 


Then by injecting the expression of v; into Equation (14), V(i, j), i 4 j,p > 1 we 
have: 


(17) 
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4 Interpretation 


In this section, we will interpret and explain the results from Equations (10) and 
(17). 

We begin by analyzing the first result which basically says that in the most con- 
strained case, each view of a multi-view framework should only acquire information 
from the view that has the closest model and not at all from the other views. The 
model specifies that in case several views have equally close models, information 
should be acquired from all of them with an equal exchange link. We can reasonably 
say that mostly pairwise exchanges between most similar views is very restricting, 
hence our proposal of a relaxed version of the weights. 

In Equation (17), we have a second result based on the conjugate problem fa- 
vorising the consensus rather than penalizing the differences. The idea is that while 
the exchange link should still be primarily stronger between views that have similar 
models, depending on the value of p the information coming from divergent models 
should be given some importance too based on their degree of divergence. In fact, 
when p grows towards infinity all weights would become equal. 


e Forp>1: 


1 
|C; j| P= 


Vj Éi, Tji = 3 TTT (18) 
(Lisi |Cia|?T)? 


e When p > ce: 
V(i,j).tji = Cte (19) 


While this interpretation seems to be in the continuity of the result for p = 1 
in Equation (10), as stated earlier, these weights were computed for the conjugate 
problem shown in Equation (12) which is consensus based instead of divergence 
based, and we shall therefore prove that these weights are still applicable and make 
sense for the original model from Equation (1) which is the one used in most multi- 
view algorithms. 

Under the hypothesis that 0 < Aj; = 1 — C;,j < 1, when we inject these weights 
into Equation (1), we have the following interpretation: The exchange link should be 
stronger when acquiring information from views that have models with the lowest 
divergences. Therefore the optimization process used to find the partitions and in 
fine the model would have two interesting properties: 


1. It would attempt to reduce the divergences in model between similar partitions. 
2. It would reduce the exchanges between too dissimilar partitions or models. 


The first property is interesting because it relates with the concept of stability 
in clustering partitions [1, 8]. Indeed, by encouraging views with similar results 
to work together in an unsupervised context, it increases the chances of achieving 
better results, since structures found in several views can be considered stable which 
is a good and unbiased indication for exploratory data mining. 

The second and first properties together have direct applications to tackle the 
problem of feature selection. If we consider multi-view data where the same objects 
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are represented using different redundant views, we know that the two main prob- 
lems are: views that contain mostly noise, views or groups of views that contains 
different and incompatibles data structures. Using our proposed weighting system, 
the noisy views would not hinder the others by sending wrong information because 
they are too different and will have low exchange links toward every other views. 
As for groups of views containing different structures, our weighting system would 
mostly foster exchanges between similar views, thus creating meta-clusters of sim- 
ilar views. These meta-clusters of views would still exchange information through 
the views that are compatible with several meta-clusters, and noisy views would 
remain isolated as outliers. 

Using the graph of the weights 7; ;, it would be very simple to detect views con- 
taining mostly noisy features, but also to remove redundant attributes by detecting 
communities of hyper-connected views forming clusters in the graph and removing 
some of them. 


If we had attempted the same optimization based on the original Equation (1) 
using A; j instead of C; j, it would have led to p = 2 being the only valid constraint 
on the weights, and on the solution given in Equation (20). 


r Aij 
t = (20) 


ji 7 
y Lei Ar, 


While this result is mathematically correct, it leads to negative weights the abso- 
lute value of which is the highest between views that have most divergent models. 
Compared with the result from Equation (17), this solution offers little interest from 
a multi-view perspective because a negative exchange of information cannot prac- 
tically translated in a multi-view framework. Furthermore, this model is difficult to 
interpret and leads to close to weak exchange weights between similar and mildly 
similar solutions. This second point is also problematic from a multi-view perspec- 
tive where the views are supposed to exchange with a goal of mutual improvement. 


5 Conclusion 


In this article, we have proposed an optimization method to set up the exchange 
weights between views in the context of unsupervised multi-view clustering. Our 
method relies on the optimization of a conjugate fitness function under the Karush- 
Kuhn-Tucker conditions and provides an analytic expression for the weights. 

Our results are interesting in the sense that the resulting weights rely on the 
unbiased notion of clustering stability instead of regular clustering indexes, and they 
are applicable to most multi-view framework unlike others weighting methods in the 
literature. Furthermore, our method creates meta-clusters of views which makes it 
possible to regroup views containing similar features that lead to different structures 
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thus enabling the detection of redundant attributes, and more importantly it makes 
it easy to detect views with mostly noisy features. 

Since this work is only a preliminary theoretical background, in our future works 
we look forward to apply it to various multi-view methods and see how it fares in 
practice. 
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Social Sensing and Official Statistics: call data 
records and social media sentiment analysis 


Social Sensing e Official Statistics: analisi di dati 
telefonici e misure di sentiment dai social media 


Emilio Sulis 


Abstract This contribution explores the relationship between indicators from offi- 
cial statistics and measures derived from new data sources. In particular, phone call 
data from Italy’s leading phone company offer suggestions on patterns of behavior, 
as well as on the demographic composition of municipalities. A classification ex- 
periment based on ego-network measures reaches an F-measure accuracy of 0.68 
(baseline: 0.63) in distinguishing high and low presence of migrants in Italian mu- 
nicipalities. In addition, this paper explores Sentiment Analysis to investigate the 
content of online social media communications. Measures about peoples’ sentiment 
from a microblogging platform show some correlations with economic data. These 
results provide interesting arguments about the usefulness of integrating new kinds 
of data to estimate subjective well-being and official statistics. 

Abstract Jl presente contributo affronta il rapporto tra indicatori provenienti dalla 
statistica ufficiale e misure derivate da nuove fonti di dati digitali. In particolare, 
l’esame di dati telefonici permette di rilevare informazioni sul comportamento delle 
persone. Tali dati hanno dimostrato di essere utili anche per valutare la compo- 
sizione demografica dei comuni italiani. Un esperimento di classificazione basato 
su misure di ego-network ha permesso di migliorare la precisione (F-measure 0.68, 
baseline 0.63) nel distinguere tra comuni italiani con alta o bassa presenza di mi- 
granti. Inoltre, é possibile esaminare il contenuto delle comunicazioni dei social me- 
dia, applicando tecniche di Sentiment Analysis. Le misure di polarità dei messaggi, 
aggregate a livello provinciale, hanno mostrato una correlazione positiva con al- 
cuni indicatori sociali. Questi risultati offrono argomenti circa l’utilità di integrare 
nuovi tipi di dati alle stime del subjective well-being e alla statistica ufficiale. 
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1 Introduction 


Several studies combine computational methods with theories and techniques of 
social sciences. Recently, Computational Social Science (CSS) investigates individ- 
uals and groups in order to understand social phenomena, organizations and compa- 
nies exploiting the so-called Big Data [6]. The data-driven computational analysis 
of large datasets offers new types of insights complementary to classical methods 
such as surveys, self-reported data and direct observations. In particular, some infor- 
mation extraction techniques include Social Network Analysis (SNA) and machine 
learning algorithms combined together. In this kind of studies based on user gen- 
erated content on a big scale, humans can be considered as social sensors [3, 4]. 
This contribution focus on a set of different experiments on new and traditional 
data: section 2 describes data and methodology related to two studies exploiting 
Call Data Records (CDR). In section 3, an experiment of text content analysis of 
social media information sheds light on the relationship between social media and 
social well-being. Finally, the last section contains some concluding remarks and 
ideas for future work. 


2 Exploiting phone calls to assess behavioral patterns, 
demographic and immigration data 


This section details two different kinds of experiment based on mobile phone 
call datasets from Italy’s telecommunication leading company “Telecom Italia”!. 
Experiment] explores SNA measures from a communication network of one week 
in Italy to give an insight into patterns of behavior and demographics. Experiment2 
concerns Ego-Network analysis investigating immigration data. In the following, we 


briefly describe related dataset, methodology and main results for each experiment. 


2.1 Patterns of behavior and demography 


Data and methodology. In Experiment | the dataset D1 includes about 300 millions 
of mobile phone calls made in Italy during a representative week of the year. The 
network of phone calls considers the geographical references of antennas as nodes, 
as well as the sum of the duration of each single call as the weight of the edges. In 
addition, we use also measures about population, real estate market and enterprises”. 

Calling patterns. Dataset D1 has face validity: as Figure 1 clearly states, the 


distribution over time of phone calls data exhibit a weekly cycle and a daily cy- 


! Thanks to Trusted Digital Life Innovation group of Telecom Italia SpA, Turin. 


2 Our data sources are ISTAT, the Italian Revenue Agency (Agenzia delle Entrate), and the Cham- 
bers of Commerce (CCIAA) 
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cle. In particular, the distribution of each day involves two daily peaks, one in the 
late morning and another one in the afternoon. As expected, Sunday presents slight 
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Fig. 1 Frequency of Italian mobile phone calls in one week using Dataset DI. 


differences: first, we noticed a lower number of calls; moreover, the morning peak 
comes later than the other days (at around 11, instead of 10), consistently with the 
sleep and wake cycles already detected in similar studies. 

Degree and centrality measures. In the network of phone calls from DI, arcs with 
higher weights connect the most relevant nodes. As degree indicates the number of 
ingoing and outgoing phone calls for each node, the lowest degree correspond to a 
site in a mountain alpine valley (Alta Valsesia), while the highest degree corresponds 
to an antenna of the “Termini” railway station in Rome. Centrality measures offer 
important suggestions about the connectivity and the position of nodes in the graph. 
Regarding the measure of betweenness, the highest value corresponds to a CDR in 
the Italian capital, consistently with the central geographical position of the city. 

Demographics and network measures. Looking at population by age, the dura- 
tion of phone calls highly correlates with the presence of young people (r=0.90), 
while the values are lower for adult (0.87) and elderly (0.79). Similarly, phone calls 
negatively correlate both with the aging index (r =-0.35) and with the incidence of 
elderly (r = -0.36). In addition, network measures correlate with the number of en- 
terprises (0.78), which is an indicator of the economic well-being of a province. A 
less evident relationship is observed with respect to the number of residential prop- 
erty sales (r = 0.48). These results assess the usefulness of phone call datasets and 
open the way to further investigation in this subject, as detailed in the next section. 


2.2 Call data records and immigration 


Data and methodology. To investigate phone calls to abroad, a D2 dataset refers 
to calls made in a not-working day. This dataset includes about 67 millions of sin- 
gle calls, including information about the duration and the direction (incoming or 
outgoing) aggregated on the single cell by their latitude and longitude values. The 
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attention here is on Ego-Network (EN), where Ego nodes are the Italian municipal- 
ities and Alter nodes are the foreign states. Arcs represent the communication links, 
weighted on the basis of calls duration. The representation of a first-order EN is a 
star, where an Italian municipality is Ego, while second-order EN includes also the 
arcs sorting from the related foreign states to other Italian municipalities*. This kind 
of EN takes into account the impact of foreign states over a municipality and over 
Italy as a whole. 

In order to obtain a single measure concerning migration movements in one year, 
a Demographic Index of Migrations (DIM) is computed for each of the 8k Italian 
municipalities as the combination of four single indicators: i. one-year variation 
of immigrant residents; iii. register office movements with foreign states; iv. net 
migration rate. Each indicator is normalized between 0 and 1 and the sum of the four 
is the value of the index. Accordingly to DIM, the list of municipalities is labeled 
with HighDIM (values over the mean), while LowDIM includes the lower ones. 
A binary classification experiment includes a set of 37 features from phone calls 
ego-networks (i.e., degree, strength, constraint, centrality), using different machine 
learning algorithms in Weka‘*. 

Classification results. As smaller municipalities values are less reliable, we fi- 
nally considered the 538 municipalities with more than 10k families. Then, the clas- 
sification task includes 254 municipalities with LowDIM and 282 with highDIM. 
The baseline measures were computed on the basis of the number of foreign states 
linked to each municipality (63.6%) and the strength of the network (62.8%). Adopt- 
ing Logistic Regression as a machine learning algorithm, the confusion matrix 
shows good results, with a number of correct predictions higher for LowDIM class 
(see Table 1). As the classification result obtain an F-measure score of 67.9, we 
finally assess the usefulness of network measures derived from phone calls. 


LowDIM HighDIM 
LowDIM 201 81 
HighDIM 91 163 


Table 1 Confusion matrix for high and low DIM. The LowDIM class obtains the best results. 


Computing the Information Gain for each feature, the most relevant ones for this 
classification task are network metrics. In particular, these features are: the dura- 
tion of calls (or the sum of the weights) in the complete second-order EN, and the 
number of arcs of second-order EN. Comparing the results with our baseline, con- 
sistently with the role of social networks in migrations, we state the relevance of 
phone-calls ego-network measures in this prediction task. 


3 As this dataset comes from a single Italian company, it does not consider the links between Alter 
nodes (the call between foreign states, i.e. from Spain to France and so on), as well as other phone 
companies data 

4 We adopt Naive Bayes, Support Vector Machines, as well as Logistic Regression in Weka toolkit: 
http://www.cs.waikato.ac.nz/ml/weka/ 
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3 Sentiment Analysis of Social Media messages 


The content of communication can be investigated by a lexical-based approach to 
count the polarity of each word in a sentence [13]. Dictionaries containing the se- 
mantic orientation of terms have been developed and used in different tasks, as in 
distinguishing emotions [7] or sarcasm detection [11]. Machine learning techniques 
automatically classify labeled instances of texts or sentences [8] in supervised clas- 
sification tasks. Furthermore, a NLP approach already detected happiness in Twitter 
Italian messages [2]. The current section introduces a general framework architec- 
ture, implemented on a large corpus of tweets. 


3.1 A general framework 


A whole framework architecture to manage social media data and official statis- 
tics, detailed in [12], consists of five main parts: providers, data gathering, data 
analysis and visualization. Providers are the data sources (i.e. Twitter, as well as 
demographic data). A data gathering module manages three particular tasks: first, 
a submodule collects information from different providers; a filtering step removes 
noisy data (i.e. duplicate records, empty voices, formatting errors); finally, results 
are standardized in a unified format. Then, a data analysis module combines several 
data processing: first of all, a sentiment analysis submodule returns a mood value 
(positive, negative, neutral) for each geolocated tweet; a mash-up submodule aggre- 
gates results by regions, provinces and municipalities; finally, data are stored in a 
database to be further processed for elaboration and visualization. 


3.2 Sentiment analysis and social indicators 


On the basis of the general framework presented in previous section, data concern- 
ing moods and social indicators can be compared as they belongs to the same period 
and to the same administrative level. A correlation analysis across moods and, in 
turn, several social indicators is performed, in order to quantify the strength of the 
relationship between the variables. 

Data and methodology. The current study considers a large dataset”, which in- 
cludes 259,886,462 Italian tweets from January to December 2014. Tweets having 
geo-location information (coordinates) are 4,686,251. The line plots in Figure 2 
show the monthly trends, both for georeferenced tweets (line plot a) and for all 
tweets (line plot b). As expected, the distribution is well-balanced in each month, 
with the exception of August which corresponds in Italy to the holiday time. 


5 Cfr. TWITA project [1]. 
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Fig. 2 The line plot with the monthly trend of georeferenced tweets (a) and all tweets (b) in Italy 
by months, year 2014. 


Data of tweets aggregated at the level of the Italian regions are described on 
the left of Figure 3. The number of tweets by region is consistent with the size of 
the population, as presented in the plot on the right. The highest number of tweets 
comes from Lombardia, the most important industrialized area in Italy. 
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Fig. 3 The distributions both of Tweets (left) and of Italian population by Regions, year 2014 
(source: ISTAT). 


Once confirmed the validity of the dataset, the polarity of messages is computed 
by applying a lexical-based approach. Polarity is computed with the formula pre- 
sented by Kramer in [5] and used also by Quercia [10], which is a normalized count 
of the occurrences of positive and negative terms from lexical resource LIWC®. For 
n tweets including a number of positive (Pos) and negative (Neg) terms collected 
in one month in a territorial unit having a certain mean (MeanPos and MeanNeg) 
and standard deviation (SDPos and SDNeg) for positive and negative contents, the 
polarity p is: 


2 Pos—MeanPos Neg — MeanNeg 
SDPos SDNeg 


p 


i=1 


6 Cfr. LIWC is presented in [9]. We adopt an own version translated in Italian. 
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A single polarity value is computed for each month and for each Italian province 
in 2014, obtaining 1,320 distinct values (12 months for 110 provinces). The five 
most positive provinces are Napoli, Milano, Savona, Torino and Cremona, while 
the 5 most negative ones are: Roma, Pordenone, Forli, Udine and Lodi. 

Correlation between sentiment and socio-economic data. The aggregated results 
for each month in 2014 and the 110 Italian Provinces are firstly correlated with 
the number of hours in which workers were laid off collecting unemployment ben- 
efits, and the demographic data concerning the number of births. These two are 
fine-grained measures available in Italy at the level both of provinces and months. 
Other data consider all provinces for the whole year, ranging from unemployment to 
bank deposits. The most promising Pearson’s correlation results between sentiment 
values and socio-economic data (p-value < 0.05) are number of active companies 
(0.50), real estate transactions (0.38) and bank deposits (0.33). While a weak cor- 
relation is observed with respect to population (0.24), registered companies rate 
(0.21), active companies rate (0.20), births (0.15) and bank loans (0.12), some in- 
dicators are not correlated at all, such as the employment or the activity rates. Fi- 
nally, the correlation between sentiment and Social Security measures is slightly 
negative, even if very weak. In this experiment, the main idea was to analyze so- 
cial media content in comparison to a set of more extensive data coming from offi- 
cial statistics. Even considering the problem of selection bias in non-representative 
samples [15], some results seem quite promising, while others are not, similarly to 
Wang [14] where the social media sentiment overall seems weakly correlated with 
official statistics. Nevertheless, the focus here was on describing the situation in 
fine-grained temporal and geographical subdivisions. 


4 Concluding remarks 


This contribution assesses the utility of digital traces in relation to official statistics. 
Social signals from phone calls gave interesting insights over socio-economic indi- 
cators, as well as Ego-Network measures have been demonstrated useful to distin- 
guishing high or low presence of immigrants. Future works will improve such anal- 
ysis in a wider range of time, including a larger phone call dataset of one month. In 
addition, we plan to extend the number of network features, e.g. including diversity. 
By considering diurnal phone calls on working days, we will test the hypothesis that 
more industrialized provinces have wider network connections with richer countries. 
Text content analysis offers different suggestions. A general framework to compare 
socio-economic data and social media content is presented. More detailed analy- 
sis would include the filtering out of messages concerning different topics (sports, 
television etc.) from the social media corpus. We suppose it would improve the 
accuracy in the computation of sentiment polarity, as well as the correlation with 
economic data. Nevertheless, identifying a smaller set of tweets would also reduce 
the reliability of the correlation results. 
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Knowledge mapping by a functional data 
analysis of scientific articles databases 


Mappare la conoscenza attraverso un’analisi di dati 
funzionali di basi di dati di articoli scientifici 


Matilde Trevisani and Arjuna Tuzzi 


Abstract Scientometrics studies in quantitative fashion the evolution of science fo- 
cusing on the analysis of publications. One of its objectives is the development 
of information systems that can help to explore the enormous amount of scientific 
articles unceasingly published. Our study proposes an information system to recon- 
struct a dynamical knowledge mapping from a functional data analysis perspective. 
The source database is a diachronic corpus which originates a words x time-points 
contingency table displaying the frequencies of each keyword in the set of texts 
grouped by time-points in the observed time span. The information system consists 
of an information retrieval procedure for keywords’ selection and a two-stage func- 
tional clustering to reconstruct the historical evolution of the knowledge field under 
investigation. 

Abstract La scientometria studia con un approccio quantitativo l’evoluzione della 
scienza attraverso l’analisi delle pubblicazioni. Uno degli obiettivi é lo sviluppo di 
sistemi di informazione di ausilio nell’esplorare l’enorme mole di articoli scientifici 
pubblicati incessantemente. Il nostro studio propone un sistema di informazione 
atto a ricostruire una mappatura dinamica della conoscenza secondo una prospet- 
tiva di analisi di dati funzionali. Il database di partenza é un corpus diacronico 
che dá origine a una tabella di contingenza parole x punti temporali contenente le 
frequenze di ogni parola chiave nell’insieme dei testi raggruppati per punti tem- 
porali lungo l’arco di tempo osservato. Il sistema informativo é costituito da una 
procedura di recupero delle informazioni per la selezione delle parole chiave e un 
clustering funzionale a due stadi per ricostruire l’evoluzione storica del campo di 
conoscenza in esame. 
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1 Introduction 


Scientometrics studies in quantitative fashion the evolution of science focusing on 
the analysis of publications. One of its major objectives is the development of in- 
formation systems that can help to explore the enormous amount of scientific ar- 
ticles unceasingly published. The two main methods for automatically designing 
lexical maps are citation-based analysis and co-word analysis. Co-citation analysis 
maps the literature under consideration from the interaction of document citations 
whereas co-word analysis deals directly with the interaction of key terms shared by 
documents. Dynamical science mapping is another challenge that aims at describing 
dynamical patterns in science evolution. 

In our study a dynamical knowledge mapping is reconstructed from a functional 
data analysis (FDA) perspective. The source database is a diachronic corpus which 
is a collection of texts including information on the time period to which they relate. 
In bag-of-words approaches, a diachronic corpus originates a words x time-points 
contingency table displaying the frequencies of each keyword in the set of texts 
grouped by time-points in the observed time span. Diachronic corpora represent the 
ideal ground for studying the history of linguistic phenomena, e.g., when a corpus 
is able to reflect the relevant features of a text genre in a well-defined time period, 
the temporal evolution of word occurrences mirrors the historical development of 
the corresponding concepts [3]. 

This study proposes an information system consisting of (1) an information 
retrieval procedure for keywords’ selection and (2) a functional clustering two- 
stage approach to identify words showing prototypical temporal patterns and cluster 
words portraying similar temporal patterns. 

The procedure has been and is being applied to corpora of scientific papers pub- 
lished by leading journals of several disciplines, namely, statistics, social psychol- 
ogy, sociology and philosophy. This work connects to the project Tracing the His- 
tory of Words. A Portrait of a Discipline Through Analyses of Keyword Counts 
in Large Corpora of Scientific Literature (University of Padova, CPDA145940, 
2015-2017), involving an interdisciplinary research group whose aim is to construct 
chronological corpora, and, hence, to investigate whether a discipline history can 
be traced from analyzing the keywords’ temporal pattern. Several analyses are per- 
formed to reconstruct a dynamical evolution: correspondence analysis, topic latent 
Dirichlet allocation, similarity analysis (using co-occurrences), and—the object of 
the present work—curve clustering. 
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2 Material: the corpus 


Our databases are collections of articles published by a selection of premier jour- 
nals of the disciplines of interest over a long time period. Text under consideration 
consist of titles and/or abstracts and/or full texts of the scientific papers. Time is 
typically discretized by years according to the cadence of volume publication. 

As an example, consider one of the corpora analyzed for exploring the histor- 
ical evolution of Statistics. The database is the collection of papers published by 
the Journal of the American Statistical Association (JASA, 1922-) and its predeces- 
sors, Publications of the ASA (1888-1912) and Quarterly Publications of the ASA 
(1912-1921). Taking into account only the texts of titles including content words and 
disregarding items that not refer to research papers (e.g., List of publications, News, 
Comment, Rejoinder), the corpus includes 10,077 titles of articles published in the 
period 1888-2012 (125 years, from Volume No. 1, Issue No. 1 to Volume No. 107, 
Issue No. 500, since at the very beginning the volumes were biennial). The corpus 
is composed of 87,060 word-tokens and 7,746 word-types. To solve the problem 
of identifying a set of keywords that prove relevant for the study of the history of 
Statistics, we adopt a stepwise procedure: 


1. to overcome some of the limitations of analyses based on simple word-types, we 
replace words with stems by means of the popular Porter’s stemming algorithm; 

2. to take into account compounds, multi-words and sequences of words which have 
different meanings when they are considered in their context of use and together 
with adjacent words, we identify n-stem-grams; 

3. to identify the most relevant statistical keywords, we match the vocabulary with 
popular statistics glossaries available on-line: ISI-International Statistical Institute; 
OECD-Organisation for Economic Cooperation and Development; 
Statistics.com-Institute for Statistics Education; StatSoft Inc.; 

University of California, Berkeley; University of Glasgow. 
4. to reduce low frequency keywords we select keywords with frequencies > 10. 


The final contingency table includes the frequencies of 900 keywords over 107 time- 
points. 


3 Method: a functional clustering two-stage approach 


From a FDA perspective, discrete observations y; = {y;j} of the frequency of a 
keyword i(= 1,...,N) in the volumes j= 1,...,7 are viewed as a realization of 
an underlying continuous function x;(t). As y; is a noisy observation of x;(t), an 
adequate model is y; = x;(t) + €;, where t = {t;} is the finite set of time-points and 
€; = {& i} is a zero mean vector with dispersion matrix Xe. 

For representing functional data (FD) as smooth functions one method is the ba- 
sis function approach where x;(t) is represented by a finite-dimensional linear com- 
bination x;(t) = Li, CikOk(t), cik € R, for sufficiently large K, of real-valued func- 
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tions @, called basis functions. In this study we consider B-splines as they consist in 
a very flexible basis for non-periodic FD. As regards the positioning of breakpoints, 
a direct and reasonable choice is placing knots at each time-point t;. 

We adopt the roughness penalty approach for estimation under which the esti- 
mate £; is that function minimizing the penalized residual sum of squares, PENSSE = 
SSE+A-PEN,, where SSE = {y; — x;(t)}/W{y;—x;(t)} (W = Dz!) is the residual 
sum of squares, PEN, = f [D"x;(s)]?ds is the penalty term and A is a smoothing 
parameter. 

A standard practice for choosing A is to use the generalized cross validation, 
GCV(A) = T/(T — df(4))?SSE(%;), which provides a convenient approximation 
to leave-one-out CV. d f(A) is the effective degrees of freedom, which is monotone 
decreasing in A with maximum equal to K when A = 0. 

We smooth the data by trying different spline orders combined with various 
roughness penalties and varying the smoothing parameter over an opportune range 
of values. 

We adopted a distance-based approach, in particular the k-means algorithm com- 
bined with the Lo metric to measure distance between curves. Besides the Lo metric 
other measures of proximity can be considered, such as the Lı metric, the adaptive 
dissimilarity index, and the correlation-based dissimilarity [2]. 

Cluster validation is an essential step in the cluster analysis process. Within the 
approaches to cluster validation [1], the use of external information is a valuable and 
ultimately necessary tool. Here, external information consists of an informal assess- 
ment of subject matter experts. On the other side, a large number of indexes has been 
proposed in the literature for a validation based on the clustered data alone. In this 
study we combine a large number of internal validation indexes without integrating 
subject matter knowledge, so as to let the data bring out the best rated groupings. 
Our clustering procedure is thought of as a tool of thorough investigation before 
submitting the results to experts who possibly will guide towards other analyses. 


4 Theory: corpus data transformation 


The decision about what data to use is an important part of the clustering process, 
and often has a fundamental impact on the resulting clusters. 

If we consider the keywords x time-points table by row, a typical feature of a 
word trajectory is a sharp peak-and-valley trend, mainly due to the sparsity affecting 
frequency data of a corpus. On the other hand, if we look at data by column they 
appear strongly asymmetrical, in particular for the marked disparity of frequency 
classes between the most popular words and all of the others. This is a typical feature 
of word-type frequency distributions aka large number rare events property. Lastly, 
the size of time-point subcorpora may vary greatly over time. 

In our research, we envision several transformations which address two different 
objectives: whether, in assessing two curves as similar, we should consider height 
(word popularity) and timing (synchrony) jointly, or timing only. In the first case 
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we just need to normalize data by column, in the other case we need to normalize 
by row, or better still, since a sort of column-normalization should be regarded as 
preliminary, to resort to some double normalization. 

The normalization step (Table 1) of our procedure provides several transforma- 
tions by colum (c1-cs) and by row (r1-r5). 


Table 1 Normalization plan 


normalized by column (corpus logic) table” logic) (LNRE) 


subcorpus column dynamic 
normalized by row #titles #tokens sum (,/-) max. freq. density 
È row sum d d di (X7) d dip rı 
E z-score by row d dy d d d rr 
E i 
7 maximum row frequency d d3 d d d r3 
20 nonlinear transformation: p,q) d d4 d d d r4 
5 nonlinear transformation: px). d d4p d d d Tap 
relative difference d ds d d d rs 
CI c2 c3 c4 C5 


Crossing a column- by a row-normalization generates a double normalization. 
Our comprehensive study examines all the transformations specifically indicated in 
the table. Here we present a small subset: c2, d1, d3. 


5 Results and conclusions 


Optimal smoothing for cy normalized data is achieved with m = 5 and A = 10° 
(df =7.7) after setting a PEN roughness penalty, whereas for both dı and d3 nor- 
malized data the criterion lead to m = 3 and A = 10!75 (df = 7.375) under a PEN, 
roughness penalty. Curves are then partitioned by the k-means algorithm on the basis 
of the Euclidean distance. The algorithm is re-run, for each k from 2 to 26, 20 times 
from different initial configurations set through the k-means+ + seeding method. 

A set of 49 quality criteria are then computed in order to identify the best par- 
tition/number(s) of clusters. By pooling rankings from all the quality indices, the 
frequency of being in the top-1 up to the top-4 is calculated for each cluster number 
k. In general partitions into two/three clusters are the best rated. This reflects the 
substantial bifurcation of the historical period around the sixties at which the birth 
of Statistics as an autonomous and established discipline can be placed. Moreover, 
partitions with a number of clusters close to the maximum of the considered range 
have also been frequently selected. This result may be a failure due to the standard 
assumption of data normally distributed. From the foregoing, once discarded the so- 
lutions picking the extremes, the most selected cluster numbers are: 5 for c2, 6 for 
dı and 5 for d2. To compare some aspects of how the three transformations affect 
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clustering, we consider the best partition found with the above numbers of clusters 
(conclusions are below). 

Let us now examine some aspects of clustering, in the three cases of normalization, 
by varying the number of clusters (Table 2): how much groups are balanced; how 
many groups are singletons; how much groups are heterogeneous in being composed 
of words of different frequency class or popularity. 


Table 2 Balance, presence of singletons and heterogeneity of frequency classes 


normalization St? 2 3 4 5 6 7 8 9 10 15 20 25 


c2 .00 .12 .26 .29 .44 49 56 59 .63 .80 .86 91 
Quality of di 72 .93 90 90 94 .96 .95 .97 .97 98 .98 .99 
balancing d3 84 .88 92 .93 .93 .95 95 .96 .97 .97 .98 .99 

c2 Da Dee 2 3. 38. Be 10 
Number of dı 0 0 0 0 0 0 0 0 0 1 1 1 
singletons da 0 0 0 0 1 0 0 I 0 3 5 
Heterogeneity c2 1 50 .06 .09 .02 .02 .05 .09 .09 .11 .05 .12 
of frequency dı 1 1 1 99 .99 .99 .98 .98 .97 .96 .95 .94 
classes da 90 95 95 93 .81 85 .80 .82 .80 .80 .78 .77 


A summary of conclusions follows. 


1. Normalization by column maintains the level of word popularity differentiated 
and produces a dominance of high frequency words on the clustering results. 
Significant imbalance in cluster size, large presence of singletons, lack of hetero- 
geneity of frequency classes in group composition and, finally, the presence of 
one or more “amorphous” groups, made up almost exclusively of low frequency 
words, are some of the most evident effects of this type of transformation. 

2. Conversely, the double normalization produces groups normally well balanced 
both in cluster size and frequency classes, rare singletons, and almost never amor- 
phous groups, but does lose the information on word popularity. 

3. In specific, type-dj normalization is better able to recognize any group of words 
having “sparse“ trajectories, i.e., which have experienced birth and/or death over 
the period considered, while the d3 variant, which more properly ’’normalizes’’ 
the frequency, builds the groups primarily looking at the curve shape, i.e., at if 
the relative popularity” of a word has been constant over time or has fluctuated 
(and how) during its life cycle. 
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Characterizing the extent of rater agreement via 
a non-parametric benchmarking procedure 


Caratterizzazione del grado di accordo intra/inter- 
valutatore mediante una procedura non parametrica di 
benchmark 


Amalia Vanacore! and Maria Sole Pellegrino” 


Abstract In several context ranging from medical to social sciences, rater 
reliability is assessed in terms of intra (-inter) rater agreement. The extent of rater 
agreement is commonly characterized by comparing the value of the adopted 
agreement coefficient against a benchmark scale. This deterministic approach has 
been widely criticized since it neglects the influence of experimental conditions on 
the estimated agreement coefficient. In order to overcome this criticism, in this 
paper a statistical procedure for benchmarking is presented. The proposed procedure 
is based on non parametric bootstrap confidence intervals. The statistical properties 
of the proposed procedure have been studied via a Monte Carlo simulation. 

Abstract Jn numerosi contesti applicativi, dal medico al sociale, l'affidabilità di 
un valutatore è valutata in funzione del grado di accordo intra (-inter) valutatore. 
La caratterizzazione del grado di accordo è tipicamente effettuata confrontando la 
stima del coefficiente di accordo adottato con una scala di riferimento (benchmark). 
Questo approccio “deterministico” è stato spesso criticato in letteratura in quanto 
non tiene in conto l’influenza delle condizioni sperimentali sul processo di stima. In 
questo lavoro è presentata una procedura di benchmark basata su intervalli di 
confidenza bootstrap. Le proprietà statistiche della procedura proposta sono state 
studiate mediante simulazione Monte Carlo. 


Key words: rater reliability, kappa-type agreement index, statistical power, Monte 
Carlo simulation 
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1. Introduction 


In many context of research (e.g., cognitive and behavioural science, quality 
science, clinical epidemiology, diagnostic imaging, content analysis), there is 
frequently a need to assess the performance of human instruments (i.e., raters) 
providing subjective measurements, expressed on a dichotomous, nominal or ordinal 
rating scale. Rater reliability is often evaluated in terms of the extent of agreement 
between two or more series of ratings provided by two or more raters (inter-rater 
agreement) or by the same rater in two or more occasions (intra-rater agreement). 
Specifically, inter-rater agreement is concerned about the reproducibility of 
measurements provided by different raters, whereas intra-rater agreement is 
concerned about self-reproducibility (also known as repeatability). 

The easiest way of measuring agreement between ratings is to calculate the 
overall percentage of agreement; nevertheless, this measure does not take into 
account the agreement that would be expected by chance alone [11]. A reasonable 
alternative is to adopt the widespread kappa-type index that was introduced by 
Cohen in 1960 as a rescaled measure of the probability of observed agreement 
corrected with the probability of agreement expected by chance alone. A main issue 
for the correct definition of a kappa-type index regards the notion of expected 
proportion of agreement: chance measurements are conceived as blind (that is, 
uninformative about the rated items) and any distributional assumption for them is 
likely to be arbitrary. A solution is to adopt the notion of uniform chance 
measurement [2] that — given a certain rating scale — can be assumed as a 
reasonable model for the maximally non-informative measurement system. This 
uniform version of Kappa is often referred to as Brennan-Prediger coefficient [3]. 

The extent of a kappa-type index is generally qualified through a benchmark 
scale [e.g. 1, 8, 10]: threshold values against which compare the estimated 
agreement coefficient for deciding whether the extent of agreement is good or poor. 
Although commonly adopted, this deterministic benchmarking approach does not 
consider that the value of the information provided by an agreement coefficient is 
unknown since, being computed on a sample of items, its estimate is subject to 
sampling error. In order to identify a suitable neighbourhood of the truth (i.e., the 
true population value), sampling error has always to be considered. 

In this paper a benchmarking procedure based on bootstrap resampling is 
proposed in order to take into account the sampling uncertainty when characterizing 
the extent of rater agreement. The main statistical properties of the proposed 
procedure have been assessed via a Monte Carlo simulation study. 

The remainder of this paper is organized as follows: in Section 2 the weighted 
Brennan-Prediger coefficient is introduced; Sections 3 is devoted to coefficient 
estimation and inference; in Section 4 the simulation design is described and the 
main results are discussed; finally, conclusions are summarized in Section 5. 
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2. Weighted Brennan-Prediger Coefficient 


Let n be the number of items rated twice (i.e., two replications) on an ordinal k - 
points rating scale (with k >2), n,, the number of items classified into i” category in 
the first replication and into j” category in the second replication and w, the 
corresponding symmetrically weight (i.e., w, =w, ) introduced in order to consider 


that on an ordinal rating scale, some disagreements are more serious than others (i.e., 
disagreement on two distant categories are more relevant than disagreement on 
neighbouring categories). 

The weighted Brennan-Prediger coefficient [9] is defined as: 


Ky = (Èa ~ Pie) /A- Pi) 
where È, DI w,(n,/n) and pi, (VT Wy 


The RY coefficient ranges from -1 to +1 and it can be assumed asymptotically 


normal distributed [9] with mean K, and variance ô}, given by: 


gar ln Sd i 
SR ed) Za; / (l-24) © 


; ; k ; 
where h refers to the generic rated item and a, = > ; Ww, (5,” — p,) with 5,” =1 


i,j= 


if the rater rated item h into i” and j” 


(h) 
ij 


category in the first and second replication, 


respectively, and =0 otherwise. 


3. Characterization of rater agreement 


The approach currently adopted to characterize the extent of agreement is based 
upon a straight comparison between the estimated coefficient and an adopted 
benchmark scale. The most widespread benchmark scale for interpreting the 
magnitude of agreement coefficients was proposed by Landis and Koch [10]. 
According to this scale, there are 5 categories of agreement corresponding to as many 
ranges of coefficient values: slight, fair, moderate, substantial and almost perfect 
agreement for coefficient values ranging between 0 and 0.2, 0.21 and 0.4, 0.41 and 
0.6, 0.61 and 0.8 and 0.81 and 1.0, respectively. 

Although benchmark scales are widely adopted for relating the magnitude of the 
coefficient to the notion of extent of agreement, some researchers question their 
validity and give advice that their uncritical application may lead to practically 
questionable decisions [11]. Actually, as argued in [9] the choice of the benchmark 
scale is less important than the way it is used for characterizing the extent of 
agreement. 
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A deterministic approach to benchmarking does not account for the influence of 
experimental conditions on the estimated coefficient and, thus, it does not allow for a 
statistical characterization of the extent of rater agreement. This criticism may be 
overcame by benchmarking the lower bound of the confidence interval of the 
agreement coefficient rather than its point estimate. 
Assuming the asymptotic normal approximation, the lower and upper bound of a 


two-sided (1- @)% confidence interval for K are given by: 
sey x 
Ky Zap Try 
The accuracy of the above confidence interval depends on the asymptotic 
normality of Ki and on the asymptotic solution for Oe which are both questionable 


for small sample sizes. 

Resampling, which is generally considered the approach of choice when the 
assumptions of classical statistical methods are not met, may yield more accurate 
confidence limits and thus it can be usefully adopted to characterize the extent of rater 
agreement. 

Among the available methods to build bootstrap confidence intervals, the 
percentile bootstrap (hereafter, p) is the simplest and the most popular one. The lower 
and upper bounds of the (1-@)% two-sided p confidence interval are, respectively, 


the (a/2) and (1-@/2) percentiles of the cumulative distribution function G of the 


bootstrap replications of Ke . On the other hand the Bias-Corrected and Accelerated 


bootstrap (hereafter, BCa) confidence interval is recommended for severely non 
normal data [4, 6]. Despite the high computational complexity needed, BCa 
confidence intervals have generally smaller coverage errors than the others. The lower 
and upper bounds of the (1-@)% two-sided BCa confidence interval are defined as: 


G'(0(b4(2,,+5)/[1+@(72,:-b)])) (4) 
the a/2 


percentile of the standard normal distribution, b the bias correction parameter and a 
the acceleration parameter. 


being ® the cumulative distribution function of the normal distribution, z,,, 


4. Simulation study 


In order to analyse the statistical properties of the proposed benchmarking 
procedure in terms of Type I error and statistical power, a Monte Carlo simulation 
study has been developed considering one rater who classifies n items into one of k 
possible ordinal rating categories. The simulation has been conducted by sampling 
r=2000 Monte Carlo data sets from a multinomial distribution with parameters n 
and p=(7,1,...,7;,-,7y);the 7, values have been chosen so as to obtain nine true 


population values of Ky (viz., 0, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1.0), 
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assuming a linear weighting scheme [4]. The BCa confidence interval for each KU 
has been built on 1500 bootstrap replications. The statistical properties of the 
benchmarking procedure have been studied for a 4=4 points rating scale and for 
n = 20,30, 40, 50 which are the most affordable sample sizes in many experimental 
contexts and also the most critical ones for statistical inference. 

Simulation results in terms of statistical significance and power are reported in 
Table 1, organized in four distinct sections each corresponding to a null hypothesis of 
rater agreement, which is tested against several specific alternative hypotheses. 


Table 1: Statistical significance (in bold) and power for different true population values of Ky. 


n=20 n=30 n=40 n=50 


p BCa Pp BCa Pp BCa Pp BCa 


KY =0.00 0.046 0.046 0.038 0.029 0.028 0.027 0.030 0.027 
Ki =0.50 0.645 0.622 0.813 0.768 0.887 0.878 0.950 0.940 
KE =0.00 Ky =0.60 0.870 0.852 0.972 0.956 0.991 0.991 0.998 0.997 
K’=0.70 0.958 0.946 0.994 0.992 1.000 0.999 1.000 1.000 
Ky =0.80 0.997 0.996 0.999 0.999 1.000 1.000 1.000 1.000 
K’=0.90 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 


Ký =0.40 0.043 0.034 0.045 0.034 0.039 0.032 0.033 0.023 
Ky, =0.50 0.091 0.087 0.124 0.117 0.131 0.123 0.156 0.132 
KÝ <0.40 Ky =0.60 0.242 0.233 0.358 0.344 0.379 0.357 0.468 0.407 
Ki =0.70 0.460 0.427 0.636 0.612 0.721 0.705 0.828 0.781 
Ky =0.80 0.774 0.749 0.915 0.906 0.966 0.958 0.987 0.987 
Ký, =0.90 0.958 0.933 0.993 0.993 0.999 0.999 1.000 1.000 


Ký =0.60 0.058 0.055 0.045 0.043 0.043 0.037 0.033 0.032 
Ky, =0.70 0.172 0.166 0.184 0.180 0.221 0.189 0.243 0.218 
KY < 0.60 Ky, =0.80 0.407 0.393 0.484 0.460 0.573 0.533 0.648 0.648 
Ky, =0.85 0.694 0.681 0.823 0.806 0.914 0.890 0.953 0.953 
Ky, =0.90 0.747 0.735 0.870 0.854 0.941 0.916 0.974 0.969 
Ky, =0.95 0.932 0.936 0.979 0.980 0.995 0.988 1.000 1.000 


Ky =0.80 0.140 0.125 0.069 0.078 0.064 0.060 0.061 0.055 
Ky, =0.85 0.329 0.346 0.246 0.254 0.312 0.241 0.291 0.253 
MO Ky = 0.90 0.425 0.452 0.380 0.363 0.451 0.394 0.452 0.405 
Ky, =0.95 0.716 0.747 0.714 0.682 0.801 0.799 0.834 0.761 
Ky, =1.00 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
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As foreseen, the statistical properties of the proposed benchmarking procedure 
improve as the sample size increases being satisfactory even for relatively small 
sample size. Specifically, the significance level is generally slightly better for BCa 
bootstrap confidence interval; it decreases with increasing sample size but it grows 
up for increasing true population value of Ky it is close to the nominal level 
(a =0.025) only in the case of null rater agreement for n>40; however, it is 


always less than 0.10 except for n =20 when testing an high rater agreement level. 
The statistical power, instead, is generally slightly better for p bootstrap confidence 
interval; for n230, it is less than 80% only in testing hypotheses referring to 
adjacent agreement categories (e.g., poor vs slight, moderate vs substantial). 


5. Conclusions 


The proposed benchmarking procedure can be suitably applied for the 
characterization of the extent of agreement over a small or moderate number of 
subjective ratings provided by one or more raters. The procedure shows satisfactory 
statistical properties in testing both null and non-null cases of rater agreement, being 
adequately powered in detecting differences in the extent of rater agreement that are 
of practical interest for agreement studies (i.e., differences of at least 0.2). 

It is worthwhile to note that the proposed benchmarking procedure can be also 
adopted to characterize the extent of inter-rater agreement which, in the case of more 
than two raters, could be estimated using a suitable variant of kappa coefficient, 
such as the Fleiss’ kappa. 
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Mining Mobile Phone Data to Detect Urban 
Areas 


Analisi di dati di telefonia mobile per l’individuazione di 
aree urbane 


Maarten Vanhoof, Stephanie Combes and Marie-Pierre de Bellefon 


Abstract The production of Urban Areas zonings at national level is characterized 
by long delays between consecutive updates. As mobile phone data has recently 
shown promising results for automated land use classification, we investigate the 
possibility to reproduce the French Urban Area Zoning (ZAUER). We exploit a 
dataset of hourly mobile phone activity profiles at cell-tower level, discuss method- 
ological challenges, and find Urban Centers to be most correctly classified. Our 
findings frame the possibilities and limits for using mobile phone data to automati- 
cally, and continously produce urban zonings 


Abstract In questo articolo esaminiamo labilt dei dati del telefono cellulare nel 
riprodurre la zonizzazione dellarea urbana francese. Partendo un dataset di profili 
di attivit di telefonica oraria registrati dalle antenne di uno dei maggiori operatori 
telefonici francesi, analizziamo le sfide metodologiche coinvolte, e identifichiamo 
i centri urbani pi facilmente predicibili. I risultati proposti mostrano alcune delle 
possibilit e dei limiti legati allutilizzo dei dati del telefono cellulare per produrre 
zonizzazioni automaticamente e continuamente. 


Key words: Supervised classification, Mobile phone data, Spatial autocorrelation, 
Urban areas, Map comparison 


Maarten Vanhoof 
Open Lab, Newcastle University, Newcastle-Upon-Tyne, UK and Orange Labs, Paris, FR 
e-mail: M. Vanhoof1 @newcastle.ac.uk 


Stephanie Combes 
INSEE, 18 boulevard Adolphe PINARD, Paris, France 
e-mail: stephanie.combes @ gmail.com 


Marie-Pierre de Bellefon 
INSEE, 18 boulevard Adolphe PINARD Paris, France 
e-mail: marie-pierre.de-bellefon@insee.fr 


1005 


Alessandra Petrucci, Rosanna Verde (edited by), SIS 2017. Statistics and Data Science: new challenges, new generations. 
28-30 June 2017 Florence (Italy). Proceedings of the Conference of the Italian Statistical Society 
ISBN (online) 978-88-6453-521-0 (online), CC BY 4.0, 2017 Firenze University Press 


1006 Maarten Vanhoof, Stephanie Combes and Marie-Pierre de Bellefon 


1 Introduction 


The growth of cities, and with it the extension of urban agglomerations, has become 
characteristic for contemporary times [Galster et al., 2001]. In this context, the iden- 
tification of economically integrated areas is crucial for the adequate implementa- 
tion of policy measures [Duranton and Puga, 2014] and thus calls for the definition 
of other typologies than purely administrative regions. As a mean of defining inte- 
grated (urban) areas, [Berry et al., 1969] suggested to rely on commuting patterns 
toward a predefined urban core to delineate metropolitan areas.In the same spirit, the 
National Statistics Office of France (INSEE) nowadays produces a zoning (ZAUER: 
Urban Area and Rural Employment Areas Zoning) that identifies cities areas of in- 
fluence at a national level. Producing urban area zonings is a complex task involving 
multiple actors and methods. As a consequence, long delays are observed between 
consecutive publications (between five and ten years in France) which contrast with 
the fast pace of change in territories. [Floch and Levy, 2011]. In this context, alterna- 
tive sources of more timely, high-resolution data associated with simpler procedures 
could contribute in a meaningful way to the production of more recurrent releases 
of such typologies. 


In this paper we investigate the capabilities of French mobile phone data to re- 
produce the ZAUER zoning and explore a procedure that could lead towards a data- 
driven and recurrent production of the typology between official releases. Our con- 
tribution is twofold. First we demonstrate how mobile phone data can be mobilized 
to develop a nationwide typology of urban areas. Secondly, we elaborate a case that 
demonstrates how supervised classification tools can be of interests to official statis- 
tics. In addition to their ease-of-use, supervised classification techniques provide for 
both classification outputs and a quality evaluation. The latter being key in official 
statistics, we deem that a wider investment in these techniques could be profitable. 


2 Urban Areas in France 


The official french Urban Areas classification (ZAUER), as produced by the French 
National Statistical Institute (INSEE), consists of 9 classes, being distinguished 
mainly by the size of the employment pole in the ’central’ Urban Unit. Major, 
medium and small centers are Urban Units offering respectively more than 10,000, 
between 5,000 and 10,000 and between 1,500 and 5,000 jobs that are inhabited by at 
least 2,000 people and cover a continuous build-up area with no more than 200 me- 
ters between buildings. Surroundings of urban centers are municipalities for which 
more than 40 % of the working population commute daily to an adjacent urban 
center. Special cases are being recognized for municipalities that have several urban 
centers to commute to (multi-polarized municipalities) or municipalities that are not 
*influenced’ by any urban centers. 
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Fig. 1 Spatial distribution of ZAUER classes. The 2010 ZAUER classification is founded on data 
collected between 2008 to 2010 in the national census survey 


3 Activity profiles from Mobile Phone Data 


Call Detail Record (CDR) data are collected by mobile phone service providers for 
billing and network maintenance purposes. CDR data gather locational, temporal, 
and interactional information (who contacts whom) every time a phone call or text 
message is initiated by a user. The spatial resolution of observations is restricted 
to the locations of cell-towers, which are not uniform in space because of demand- 
driven placement. 


In this study we use an aggregated CDR dataset from France provided by Orange 
and based on the activities of 18 million subscribers during a period of 154 consec- 
utive days in 2007 (May 13 to October 14). Anonimysation at individual level was 
complemented with aggregation at the cell-tower level (see next paragraph) hereby 
ensuring full privacy of individual users as demanded by the French Data Protection 
Agency (CNIL) in light of the EU General Data Protection Regulation. 


Literature validates the hypothesis that cell-tower profile activity (amounts of 
events measured at an antenna over time) can be informative for territory qualifi- 
cation [Soto and Martinez, 2011]. Therefore, we construct antenna activity profiles 
as a time series of the amount of activities registered each day at every hour and 
standardize the series for comparative purposes. Next, the obtained relative hourly 
profiles are averaged per hour of the day for the entire six-months window resulting 
in activity profiles for each antenna (’signatures’). In total we build four distinct sig- 
natures per antenna averaging each hour on 1) weekdays in non-summer months, ii) 
weekends in non-summer months, iii) weekdays in summer months and iv) week- 
ends in summer months. This results in 24 x 4 features per antenna (figure 2). 
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4 signatures 


Fig. 2 Average relative activity profile for all antennas grouped by summer/non-summer months 
and week-/week-end days. 


Because standardizing the activity profiles implies a loss of information about 
the absolute amount of activities, we add the circumference of the Vorono poly- 
gon for each antenna as a complementary feature. Lower circumferences indicate 
locally higher antenna densities and, given demand-driven placement, higher ex- 
pected amounts of activities by the operator. 


4 Methodology 


4.1 Classification methods 


For our classification task, we consider each antenna as an observation that needs 
classification in one ZAUER class. The output of the procedure will form a zoning 
that can be compared to the offical one. Multiple algorithms are available to carry 
out multiclass classification procedures. We implement the random forest approach 
described by [Breiman, 2001], Boosting Trees [Schapire, 2003] and the Elastic-Net 
penalized Logistic Regression [Zou and Hastie, 2005]. 


The Logistic Regression with Elastic Net penalty consists of maximizing the like- 
lihood under a constraint expressed on the coefficients’ amplitude [Zou and Hastie, 2005]. 
Specifically, this approach is better in accounting for multicollinearity between fea- 
tures (which is likely to happen here since our features are extracted from a tem- 
poral profile) than the initial LASSO [Tibshirani, 1996]. In contrast to the Logistic 
Regression, Random Forests and Boosting Trees do not assume linear interactions 
between variables. Random forests [Breiman, 2001] aggregate classification trees 
built on bootstrap samples, but introduce randomness by sampling a set of regres- 
sors from the initial set of variables at each separation step of each tree. Boosting 
Trees [Schapire, 2003] is rather different. It is an additive adaptive procedure which 
takes into account the biggest forecasting errors at a given iteration when calibrating 
the next iteration, by actualizing observations’ weights. 
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4.2 Challenges 


Mobilizing mobile phone data for urban areas classification at a national level raises 
several challenges. First the official ZAUER classification consists of 9 imbalanced 
classes, meaning that both municipalities and antennas are heterogeneously dis- 
tributed among the classes and with respect to the urban tissue. In anticipation of 
this problem, we regroup the existing 9 classes into 6 by merging medium urban 
centers, small urban centers, and their respective surroundings. More importantly 
however, we apply downsampling, which consists of removing instances from the 
majority class, to minimize the effect of imbalanced classes on our classifiers. 


Secondly, the extended area of investigation implicates high degrees of spatial 
autocorrelation and high volatility of antenna activity profiles within the different 
classes. As such we pay special attention to spatial autocorrelation. To de-correlate 
testing and training sets, we first operate stratification sampling while segmenting 
the map of France in four comparable quadrants to produce an initial test set (guar- 
antying some minimal representativeness among regions). Next, the nearest neigh- 
bours of each selected observation are added to the test set so that we can evaluate 
the algorithms on their ability to reproduce a zoning and not only a punctual clas- 
sification of one antenna. Finally, once the test set is built, we consider every left 
antenna not located in a buffer region around the test observations as a training sam- 
ple. This last step ensures spatial de-correlation. We use the same approach to build 
the data partitions mobilized for cross validation. 


Thirdly, Urban Areas are characterized by various degrees of similarity and spa- 
tial entanglement (especially at the borders of areas where the validity of urban 
area typologies may be less reliable). We address this issue by recoursing to the 
use of the Fuzzy Kappa metric ([Hagen, 2003] and the improved Fuzzy Kappa 
[Hagen-Zanker, 2009]) allowing us to evaluate (and calibrate) our models while tak- 
ing into account both location and category fuzziness. 


5 Results 


Following the procedures outlined before, we applied three classifiers in order to 
predict urban areas from signatures of mobile phone activity. Kappa and Fuzzy 
Kappa computed on the test sets are reported in table 1. Fuzzy Kappas values range 
from 0.66 to 0.70 whereas Kappas stand between 0.59 and 0.65. According to mag- 
nitude guidelines in literature (for example [Landis and Koch, 1977]), values be- 
tween 0.61 and 0.80 are considered substantial (1 being the perfect agreement). 
Detection rates per class are represented in figure 3 for the different classifiers. We 
can see that synthetic measures like Kappa or Fuzzy Kappa mask an heterogene- 
ity in the detection rates par class, highlighting the fact that some classes are more 
difficult to detect than others. 
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Table 1 Kappa and Fuzzy Kappa for the different classifier 


Method Kappa Fuzzy Kappa 


Random Forests (RF) 0.62 0.67 
Boosting Trees (BT) 0.63 0.69 
Elastic Net (ENet) 0.59 0.67 


da 
iw 
> 
w 
> 
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Fig. 3 Classification rate (in %) per class for each classifier 


We observe small differences in the capabilities of the algorithms to detect 
classes. In general, major urban centers (class 1) present an excellent rate of de- 
tection (about 95 %) for every approach. Correct rates (50 to 70 %) can be achieved 
for classes 4,5, and 6 (medium and small urban centers and their surroundings, mul- 
tipolarized municipalities, and isolated municipalities). Classes 2 and 3 are more 
difficult to discern. Especially Class 3 (multipolarized municipalities in a large ur- 
ban area) whose detection rate varies from 30 % to 60 % in the best scenario. Class 
2 (surroundings of major urban center) get properly detected for only 40 to 50 % 
of the cases. The results of our classification for Normandy, a region in the west of 
France that mixes all urban classes are displayed in figure 4. 


Fig. 4 Observed (obs) and predicted (based on the different classifiers) Urban areas for Normandy 
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6 Discussion 


Our most remarkable findings are the difference between the accuracy of the predic- 
tion for major urban centers (class 1) and the heterogeneous performance in predict- 
ing the other classes (ranging between 30 and 70 %). Still, different classifiers show 
consistencies in results (with slight variations observed for one class or the other), 
which urges us to find explanations for these results based on the characteristics of 
our validation data (ZAUER) and mobilized data (mobile phone). 


A first remark can be made by reflecting on municipalites in border areas be- 
tween different zauer classes. In this perspective, antenna signatures of two distinct 
urban areas may sometimes be very similar and thus hard to distinguish (a case 
rather common in border areas). The difference of about 0.1 between Fuzzy Kappa 
and Kappa, however, denotes that our algorithms are able to, at least, partly cope 
with this difficulty by predicting sometimes wrongful but close classes (in terms of 
similarity of the classes or of spatial location). 


A second remark urges us to consider the limitations of mobile phone as a data 
source for urban areas recognition. CDR data is, by design, subordinate to users’ us- 
age and the extracted activity profiles are subordinate to user’s movement patterns, 
both of which might differ between regions and urban areas. In addition, local mar- 
ket shares of single operators are unknown making it impossible to correctly control 
for representativiness. Other uncertainties stem from uing spatial resolution of the 
cellular tower network. This resolution does, of course, not collide with adminis- 
trative borders which renders a discrepancy between information gathering and the 
proposed classification task. Ultimately, antenna positioning may hinder the antenna 
signature to be characterized by the presence of (local) populations. Some antennas 
might capture only specific behavior of local subscribers when, for instance, being 
positioned along transport axes. 


The choice of the methods seems less at stake. One alternative would have been 
to recourse to unsupervised techniques, which is often done in land use literature 
[Aguilra et al., 2014, Soto and Martinez, 2011]. Yet differences in antennas signa- 
tures can be interpreted in multiple ways. Exploring supervised classification there- 
fore appears as a useful preliminary step for relevant features extraction. In this 
context, classification algorithms with feature selection designs like penalized lo- 
gistic regression and random forests are extremely useful as they allow to identify 
the features contributing most to the discrimination of the classes of interest (ampli- 
tude of the coefficients in the penalized logistic regression, importance measure of 
the variables in random forests), hence leveraging interesting insights. 
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7 Conclusion 


Concluding, we would like to consider improvements and applications. In terms 
of applications, our results encourage us to promote the use of mobile phone data 
as an alternative source for producing recurrent urban area zoning between official 
but less frequent releases. Specifically, We are quite optimistic on using supervised 
classification to, for example, show patterns on the emergence of urban centers or 
the progression of urban areas. We reckon, however, that assessments regarding the 
urbanization of rural and isolated areas should remain cautious as our classifica- 
tion tasks underperformed there. Thinking improvement, we hope that more recent 
sources of CDR data, or other sources like DDR (data detail record data) could 
provide for more dense and frequent observations in remote areas, improving our 
automated classification. Ultimately, we hope that the results of our classification 
case can encourage a more widespread use of machine learning techniques in offi- 
cial statistics.In our work we have shown that the application of such techniques is 
rather straightforward and can be instructive for both future and past work. 
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Statistical methods in assessing the equality of 
income distribution, case study of Poland 


Metodi statistici per valutare l'uguaglianza nella 
distribuzione del reddito, caso di studio della Polonia 


'Viktoriya Voytsekhovska, Olivier Butzbach 


Abstract The development of gross wages for employees in Poland for a different 
time periods by social justice criteria is made. Based on gross wages the interval 
differentiation approach is considered for correlation to gross remuneration in 
adjacent time period. The use of dependencies is carried out to determine the value, 
that increases the gross wages and to determine the rate of remuneration's growth 
rate for individual categories of employees. To assess the adequacy of the developed 
approach the results of theoretical calculations are compared with statistical data. 
The conclusions regarding the availability of sustainable patterns of equal 
distribution by achieving appropriate dynamics of wages' growth in certain 
categories of workers were done. 

Key words: social justice, gross wages , income distribution, equality 


Abstract In Polonia negli ultimi anni si è perseguito l’obbiettivo di sviluppare il 
salario lordo dei lavoratori dipendenti sulla base di principi di equità sociale. 

In questo lavoro, con riferimento agli anni dal 2010 al 2014, ci proponiamo di 
analizzare le relazioni degli indici di concentrazione dei redditi con il livello dei 
salari lordi dell’anno precedente. 

Nelle conclusioni del lavoro vengono discusse le possibilità di impiego di modelli 
sostenibili di equidistribuzione per raggiungere appropriate dinamiche di crescita 
dei salari in determinate categorie di lavoratori per ammontare di reddito. 


Parole chiave: giustizia sociale, salari lordi, distribuzione del reddito, uguaglianza 


Introduction 

During the past decades the social and income inequality questions, besides 
evident recession from global financial crisis and GDP growth, raise extended 
interest of number of international organisations, such as United Nations, European 
Union, OECD, politicians and scientists globally [2, 13, 14, 16]. 
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Recent study conducted by OECD confirms the hypothesis from number of 
scientists, that indeed besides the GDP growth, inequality increases, therefore 10% 
of richest population receives 9.5 times the income of 10% of the poorest, unlike in 
80’s this relation was 7.1[5, 10, 11]. However, Atkinson and Tinbergen, studying 
inequality for 40 years highlight the importance of investment in human capital and 
its education, that would respond to increasing demand of highly skilled employees 
in technologically growing world [1, 13]. Also Italian economists M.Raitano and 
E.Granaglia see the roots of inequality in globalizarion and routinization by means of 
examining the good and the bad effects of certain policies on inequalities [3, 10]. 
The magnitude of inequality differs from country to country and has variety of 
factors from tax and educational policies to decision-making. 

A Structural Perspective brought by Van de Sande and Byvelds argues, that 
social work research, including statistics, should be taught from a structural 
perspective and must follow anti-oppressive principles, which view the problems 
experienced by people as rooted in the social, political and economic structures of 
society [15]. 


1.1 Main study 


On a macro-level Gini coefficient among countries increases despite the fact, that 
during past years, economic growth leads to new jobs creation and lower 
unemployment, which in turn shall decrease income inequality. As can be observed 
from Fig.1 in OECD countries the Gini coefficient of household income, that varies 
from 0 to 1, where 0 is 100% equality, it can be observed, that in some countries it 
decreased since 2007 in such countries as Turkey, Iceland and Latvia, while 
increased in Slovak Republic, Spain and Sweden [5,9]. 
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Figure 1: Gini coefficient of disposable income inequalities 2007-2014, OECD countries [9]. 
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However, economic growth doesn’t effect all countries equally. Poland overall 
has little decrease of Gini coeficcient since 2007 (from 0,32 to 0,30 points), while 
overall inequalities increased since 80’s due to the change from socialism to market 
economy [11, 13]. 
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Listed above we can observe the historical data of Gini coefficient from OECD 
countries, inequalities can come both from economic or social nature and it’s 
important to understand the dynamics for concrete country in order to be able to 
make adequate decisions and policies. 

The goal of this work was to analyze the dynamics of employees’ wages growth 
in Poland over the period of 2010-2014. The feature of this study was that according 
to statistical groupings in Poland there are 15 slots for wages, which we grouped into 
5 in order of their value increase [6,7,8]. In each of these groups the smoothing 
determined the average level of wages. Thus for each group we have obtained 5 year 
average salary in accordance with the law of their distribution. The initial data can be 
described by dome shape distribution law (Figure 2). 
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Figure 2. Wage distribution in Poland, 2012 [7]. 


The Fig.2 illustrates the income distribution, which is asymmetrical, it determines the 
result that 50% of employees in the range of lower wages obtain only 29% of total 
wages. By means of appropriate accumulated values to determine the Lorenz curve 
and the Gini coefficient is presented on graphic interpretation of Fig. 3 according to 
the 2012 data. According to the empirical data, we can observe the peak in 8" 
interval of salaries, that is the same for each studied year, that in our opinion, means, 
that the share of employees with such salary is maximal. 
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Figure 3. Lorenz curve for the Polish data on gross wages (2012) [7]. 


To simplify the determination of Gini coefficient, the Lorenz curve is 
approximated by the following polynomial function: y = 0,008x? + 0,017x + 4,397. 
This enables to simplify the determination of the area under the Lorenz curve by 


100 
integration h y (x)dx, which equals to 3191,367. Then the Gini coefficient: 
G=36,2%. 

It should be noted, that function of parabolic type is one of the approach and can 
be described by other type of function, such as spline and others. 

It should be noted that this figure is found for gross salaries only and in general it 
is calculated for the entire population, taking into account other factors. Also it 
should be highlighted, that the distribution and G coefficient are approximately 
identical for all the period. Its value is determined for the labour market may differ 
due to additional redistribution of similar size defined for the entire population. 

The sustainability of Gini coefficient in the target interval of time leads to the 
conclusion that the general practice in business reached approximately the same 
level of fairness of the distribution of wages. But it is important to analise the 
dynamics of salaries for certain categories of workers by their wage. 

Directly, the dynamics of wages growth, as we found differs in some selected 
groups of employees. These selected groups have the feature, that number of 
employees is almost constant in each studied period. For the selected 5 groups, this 
dynamic is shown on Fig. 4. 
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Figure.4. The dynamics of wages in 5 major groups of employees[6,7,8]. 


According to the actual statistics the wages of employees in 4 years increased 
from 1,564 to 2.148 ths. PLN, which equals 37%. The salaries of employees in the 
group with highest wages over the same period increased from 9.283 to 10.806 ths. 
PLN, so by 16%. When oriented in general on all employees, the average salary of 
3.373 ths PLN in 2010, rose in 2014 to 3.981 ths.PLN, 18%. Thus, we have different 
dynamics of growth of wages for individual groups of employees. And the biggest 
difference in growth is among the groups with the lowest and highest earnings. 

The research allowed to identify some patterns of non-standard quantitative 
growth of employees' wages on their teams. The following linear dependences were 
obtained by means of correlation with almost functional dependence ( R?= 0,995). 

Between the salaries of employees in 2010 and 2012 the following relationship 
takes place : x2012=1,092x2010+0,069. 

The important here is the fact, that this relationship is one and the same for all 5 
groups of employees. So we have the pattern, according to which the distribution of 
wages in 2010, transformed in their division in 2012 for one and the same linear 
dependence. A similar transformation is made for salaries in 2014 and the 
corresponding dependence is as follows: x2014=1,039 x2012+0,174. 

Here this relationship is valid for all 5 groups of employees. The presence of a 
linear relationship can turn to determine the appropriate rate of wage growth: 


x 0,069 x 0,174 

n == = 1,092 + —; 2)=** = 1,039 + È 

X2010 X2010 X2012 *2012 
Conclusions 


According to obtained dependencies, the rate of wage growth is inversely 
proportional to their basic values - higher wage increases to a lesser extent. 
Therefore, the growth rate of wages contains two components, including a constant 
and a variable that inversely proportional to the basic salary. 
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So we can assume that some justice is achieved in such a distribution of total 
payroll. It should be noted, that a constant linear dependence undergoes some 
changes with time influenced by the process of economic growth and changes in the 
average wage of workers. In subsequent studies, in our view, it is important to 
consider the factors and rationale availability transformation formulas on wages and 
their existence in the economies of other countries.In subsequent studies, in our 
opinion, the factors and rationale availability of transformation formula regarding 
costs shall be considered along with comparative studies for other countries. These 
results are important for economic decision and social policy making and requires 
future analysis. 
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Network inference in Genomics 


Ernst C. Wit 


Abstract The whole concept of network inference in genomics has multiple mean- 
ings and interpretations. It can refer to “causal” or “topological” considerations, i.e., 
learning about functional relationships in the genomic system or to considerations 
about the structure of the overall genomic network. Moreover, the genomic network 
does not really exists and can refer to gene regulatory networks, cell signalling net- 
works, metabolic networks etc. 

In this manuscript I aim to clarify the underlying genomics in order to motivate a 
hierarchy of four network inference strategies, starting at the single cell level and 
finishing at global structural network inference. It will involve stochastic and ordi- 
nary differential equation models, causal inference and graphical modelling. 


Key words: networks, stochastic differential equations, ordinal differential equa- 
tions, causal inference, graphical models 


1 Introduction 


Networks have become an important paradigm to describe genomic systems: from 
describing the physical, molecular interactions between proteins to the abstract in- 
teractions functional genetic units, the jargon of networks has been adopted eagerly 
by biologists tasked with study of this complex system. In this paper, we outline four 
modelling strategies, that are useful in various aspects of this enterprise. We start 
with a system of stochastic differential equations to describe single cell interactions. 
Often, however, data is observed at either a more agglomerated or across a number 
of cells that are destructively sampled. In those case, temporal models are more ap- 
propriately described by means of ordinary differential equations. Both these models 
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are inherently dynamic. Nevertheless, it is sometimes more appropriate to describe 
the genomic interactions by means of a single directed network, whereby the arrows 
have an inherently causal interpretation. We will introduce Pearl’s causal framework 
and show how we can extend this beyond the usual Gaussian assumptions. Finally, 
we will drop the causal assumptions to merely describe the relationships between 
genomic components. This can give hints about the existence of certain interactions 
that may be responsible for particular phenotypes. 


2 Stochastic differential equations for single cell interactions 


The process of carrying over of signal (information) in the cell’s environment is reg- 
ulated by various signal transduction pathways. This signalling process is typically 
started by an external stimulus of the pathway leading to a binding of the signal 
to a receptor, i.e. hormones or growth factors, and ends up by a binding of a tar- 
get protein. All cellular decisions such as cell proliferation, which refers a frequent 
and repeated reproduction of the cell, differentiation, which is the development of 
cells with specialized structure, or apoptosis, which implies cell death as a result of 
an intracellular suicide programme, are directed by different levels of transductions 
Hornberg (2005). 
A general biochemical reaction can be defined as 


ki Qi +kyQa +... + kiQ > 51 Pi +52P)+...4-5pPp. (1) 


where the terms on the left side, denoted as Q, are called the reactants and ones 
on the right side, denoted as P, are named as the products. The coefficients k; 
(i= 1,...,1) and sj (j = 1,...,p) represent the stoichiometric coefficients associ- 
ated with the ith reactant Q; and the jth product P}, respectively. / refers the number 
of required reactants and p stands for the number of resulting products. So the chem- 
ical interpretation of this equation is that k; molecules of type Q; collide with each 
other and produce s; molecule of type P; while molecules move around randomly 
by Brownian motion Wilkinson (2006). Therefore under thermal equilibrium and 
fixed volume a biochemical reaction shows which species and in what proportions 
react together and what they produce Bower & Bolouri (2001). 

For a set of r reaction and n species, accordingly, we can show the molecular 
transfer from reactant to product species as a net change of V = S— K where V 
is called the n x r dimensional net effect matrix when S denotes the n x r dimen- 
sional matrix of stoichiometry of products and K is the n x r dimensional matrix of 
stoichiometry of reactants. 

The master equation is defined as a differential equation for the process transition 
probability and can be written as: 


TED = LA kV OPA -VoD MA A 
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By means of a multivariate Taylor expansion, is possible to derive an equivalent and 
alternative formulation of any master equation, named the Kramers-Moyal expan- 
sion Kampen (1981): 


gy ee N d" 
dt Æ m! EP DG aX 


Jis- jm=1 Jm 


[am(X,0)P(X,t)] (3) 


where am(X) are m-order symmetric tensors commonly called jump moments Moyal 
(1949) or propagator moment functions Gillespie (1992). 

We will derive a methods of moments type estimator to infer the parameters of 
interest. This involves matching for each observation X; with its expected value 
given the previous observation, 


X, =m(t;0) +e; (4) 


where m(t;@) is a known non-linear function of the process state at time step t — 1 
and &, is an N-dimensional mismatch variable with E[g;] = Oy and Var(¢,) = W, and 
W, = g(X:-1;0) is a N x N matrix, for some conveniently defined expectation and 
variance operators. 
The conditional expectation of the process at time t given the previous time point 
t-1 
m(t;0) = E[X.|X;-1,0] 


is non-linear. It is an N-dimensional vector corresponding to the N predicted val- 
ues for X; at time ¢ given the previous observed process state, X;_1. Then we can 
derive a system of differential equations for the evolution of process first moments. 
We define the function m as the solution of the N-dimensional system of ordinary 
differential equations, 


dm;(t ; ; 
a ) = E[a\(X;;0)|X,-1], i=1,...,N, (5) 
with initial conditions m(t — 1) =X,-1, i= 1,...,N, where 
ù y 
ai (x,0) = Y Va/u(x;0), i=1,...,N. (6) 
k=1 


dmi(t ; 
If the hazard functions hx are linear in x, then (5) simplifies to mi ) =a'(m;0), i= 
1,...,N. 
3 Ordinal differential equations for genomic interactions 


Consider the gene regulatory or signaling network, described by a system of ordi- 
nary differential equations of the form 
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where x(t) takes values in Rf, € in € CR“, 0 in © C R and f = (fi,..., fu)! 
is a known function. Given the values of ë and @, we denote the solution of (7) 
by x(t) =x(t,@,&). For simplicity, assume that we have noisy observations Y;(t;), 
j=1,...,n of the first 1 < dı < d states x;(t,0,&), i= 1,...,d; at time points 
tj € [0,7], j=1,...,n: 


Y,(t;) = x;(t;,0,8)+;(t;), i=1,...,dy;j=1,...,n, (8) 


where 0 < tı < --- < tn =T < œ and &;(f 39) is the unobserved measurement error for 
x; at time tj. The problem is to estimate @ from the data Y, where Y = {Y;(t;)}ij 
denotes the matrix that contains all the observations. 

We define an estimator Ô, of @ as follows: 


A 


0, = argming-gM,(0,%(0)|Y), (9) 


where for every 0 € © the approximation £(-,0) of the ODE solution x(-, 0) is 
defined by 
*(0) = argmin,c x, Ja, y(x|0, Y). (10) 


Here .%,v is a functional with tuning parameters œ > 0 and y > 0, which is opti- 
mized over a certain finite-dimensional function space Zn. My is a criterion function 
to be minimized, for example the negative log-likelihood criterion. Minimization of 
Mh, involves starting from some initial guess 00 and iterating over the parameter 0, 
where at every iteration minimization problem (10) is solved. 


4 Causality in genomic networks 


Pearl (2009) defined causality through intervention, whereby variables were exter- 
nally manipulated to take certain values. This intervention changes the underly- 
ing distribution P and can be expressed by adapting the direct effect diagram. The 
new distribution is called the intervention distribution and we say that the variables, 
whose structural equations we have replaced have been “intervened on.” The inter- 
vention distribution of Y when doing an intervention and setting the variable X; to a 
value x; is denoted by P(Y|do(X; = x;)). The intervention on variable X; is charac- 
terized by a truncated factorization, in which an intervention DAG G’, arising from 
the non-intervention DAG G can be defined by deleting all edges which point into 
the node X;. Whereas most theory is derived using a underlying multivariate nor- 
mal distribution, Mahmoudi & Wit (2017) derived a way to extend the causal effect 
calculus to the class of nonparanormal distributions. 
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5 Graphical models for genomic networks 


Graphical models are a general class of probabilistic models that interpret the net- 
work as a conditional independence graph. This simple assumption is both quite 
sensible from an applied biological point of view and powerful from a computa- 
tional point of view. The absence of a link in a genetic network typically means that 
the associated two nodes do not interact or regulate each other directly. This means 
that if any intermediate regulator or metabolic state were to be kept fixed, then the 
pair of nodes would either not vary at all or the variation would be unrelated to 
each other. Computationally, this conditional independence structure by means of 
the Hammersley-Clifford theorem directly translates into simple factorization of the 
probabilistic model in more easily manageable components. 

Most texts on graphical models start with the general theory and slowly build to- 
wards more practical models, such as Gaussian graphical models. We have decided 
to turn this around. We begin with the very specific Gaussian graphical model, which 
has the advantage of being estimable even in large networks. Then we slowly peel 
away the restrictions of such models to be able to deal also with more complicated 
structures. 

We will describe a network inference method for sparse high dimensional bio- 
logical networks. In the Gaussian graphical model setting, this means specifically 
inferring a multivariate normal with many zeros in its associated precision matrix. 
As well as making the problem more tractable by reducing the number of parameters 
to estimate, sparsity of the network is also something expected from the underlying 
biology. A sparse estimate of the precision matrix © can be obtained by imposing 
the L;-penalty constraint directly on the entries of the precision matrix, rather than 
on the regression coefficients associated to each node. The graphical lasso, as it is 
called, is defined as the following solution, 


Oc =arg max [log|@|—Trace(SO)| 


x 
||] <C 


where |||]; = X;.;|0;;| and C is a non-negative tuning parameter. 


6 Discussion 


In this manuscript, we have introduced the idea that depending of the underly- 
ing datastructure and the question of interest, it is crucial to adjust the modelling 
framework. Although networks have become an important modelling paradigm in 
genomics, there currently is no single network model to describe the genomic in- 
teraction structures, and in many ways, it will be unlikely that there will ever be 
one. In fact, the fact that the underlying generative model in biology is extremely 
complicated, we will always rely on convenient parametrizations to answer specific 
questions that arise in system biology. 
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Using Twitter data for Population Estimates 


Usare dati Twitter per Stime di Popolazione 


Dilek Yildiz, Jo Munson, Agnese Vitali, Ramine Tinati and Jennifer Holland 


Abstract Twitter is increasingly being used as a source of data for the Social 
Sciences. However, deriving the demographic characteristics of users and dealing 
with the non-random non-representative populations from which they are drawn 
represent challenges for social scientists. This paper has two objectives: first, it 
compares different methods for estimating demographic information from Twitter 
data based on the crowd-sourcing platform CrowdFlower and the image-recognition 
software Face++. Second, it proposes a method for calibrating the non-representative 
sample of Twitter users with auxiliary information from official statistics, hence 
allowing to generalize findings based on Twitter to the general population. 

Abstract Twitter è sempre più usato come fonte di dati per la ricerca sociale. 
Derivare le caratteristiche demografiche degli utenti di Twitter e la natura non- 
random e non rappresentativa del campione, pero, rappresentano una sfida. Questo 
lavoro si propone due obiettivi: il primo é di confrontare due diversi metodi per 
derivare le caratteristiche demografiche degli utenti di Twitter, uno basato sulla 
piattaforma di crowd-sourcing CrowdFlower, l’altro sul software di riconoscimento 
di immagini Face++. Il secondo obiettivo propone un metodo per calibrare il 
campione non rappresentativo di Twitter con informazioni sulla popolazione 
ottenute da fonti di statistica ufficiale, in modo da poter fare inferenza sulla 
popolazione di interesse, partendo dal campione non rappresentativo di Twitter. 


Key words: Calibration, Population, Social Media, Twitter 
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1 Introduction 


Twitter data, just like data from other social media, are increasingly being used for 
Social Science research. However, such data are not representative of the total 
population: the inference made using social media is hence invalid. Also, the basic 
demographic characteristics of the Twitter users are not readily available and hence 
need to be estimated. This paper proposes a method based on calibration for 
reducing the existing bias between the Twitter population and the total population 
and it compares the results obtained using two different methods for estimating the 
demographic characteristics of Twitter users with the aim of establishing best 
practices which were used in previous research (e.g. McCormick et a. 2015; Zagheni 
et al. 2014). 


2 Data 


We collected Twitter data between 23 June and 4 July 2014 using DataSift’s Twitter 
Firehose connection. This one-week period straddles the mid-year population 
estimates (MYE) for the usual resident population of England and Wales on 30 June 
2014 which are produced annually by the Office for National Statistics (2015). Our 
Twitter sample consists of users who tweeted at least once during the reference 
week. In addition, we restrict our sample to those Twitter users who have at least one 
geo-located tweet in South-East England during the week of observation. The final 
sample comprises 22,356 unique users. 


3 Estimating Age and Sex of Twitter Users 


We estimate age and gender of the Twitter users using two distinct methodologies: 
crowdsourcing, via the CrowdFlower Crowdsourcing platform, and the image- 
recognition software Face++. By restricting our sample to all geo-located tweets, we 
further have information on the location of the users. 

CrowdFlower provides access to a large pool of crowd-workers who will 
execute a specific task in exchange of a monetary reward. We designed a task which 
presented crowd-workers with a user’s profile description and picture (if available) 
and random tweet, and asked them two questions: “Would you say this Twitter user 
is female; male; don't know; the Tweeter is a company/organization/not a person” 
and “Take the best guess at the user’s age in years: 0-19; 20-29; 30-39; 40-49; 50+”. 
Given the cost of such experiment, we restrict the sample to be analysed by the 
crowd-workers to Twitter users in the South-East England. 

Face++ is an automated face-detection algorithm developed by Megvii Inc. 
(2013). Face++ takes links to image files as its input variable and outputs an age and 
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gender estimate. Face++ demands that there are one or more distinguishable faces in 
the image provided in order to return a valid result, hence images showing non- 
human entities, or where the algorithm is unable to identify a face return a null result. 

Figure 1 reports the population pyramids from the 2014 Twitter population 
with demographic information estimated via CrowdFlower and Face++. For both 
Crowdflower and Face++, males outnumber females in all age groups, with the 
exception of ages 0-19 in Face++. According to the gender estimates based on 
CrowdFlower and Face++, we find that the average number of males per 100 
females in the Twitter sample to be equal to 149 and 138.6 males, respectively, 
whereas there are 96.8 males per 100 females according to the 2014 MYE. 

According to CrowdFlower, the age group 20-29 represents the modal age 
for both males and females, followed by the age group 30-39. The age groups 0-19 
and 50+ are, as expected, the least represented age groups in the Twitter sample. For 
Face++, the most frequent group in the Twitter sample is the males aged 30-39, 
followed by both males and females aged 20-29. The youngest age group represents 
a higher proportion of the total Twitter population compared to the CrowdFlower 
estimates, especially among females. 

In order to compare CrowdFlower and Face++, we compute a measure of 
performance for algorithms which attempt to assign data points to one of two or 
more categories, i.e. the Total Accuracy, as follows: 


Total Accuracy = (TN + TP) / (FN + FP + TN + TP) (1) 


where T and F stands for True and False and N and P stands for Negative and 
Positive, respectively. In order to compute the Total Accuracy, we refer to a gold 
standard set of 123 randomly selected users with a valid profile picture, for whom 
we know the true age and sex as these were manually verified using LinkedIn 
profiles, Electoral Roll listings, personal websites. As Table 1 shows, the accuracy is 
higher with CrowdFlower. The gender matching when there is a valid profile picture 
is nearly 92% accurate for Face++ and 97% for CrowdFlower, but the age matching 
is only 35% accurate for Facet+ vs. 79% for CrowdFlower. 
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Figure 1: Population pyramids based on Twitter data, demographic variables estimated with 
Crowdflower and Face++ 


CrowdFlower Face++ 


Table 1: Accuracy of Face++ and CrowdFlower 
Total Accuracy, valid images (N.=123) 


Age Gender 
Face++ 35.8% 91.9% 
CrowdFlower 73.2% 97.6% 


4 Calibration Methodology 


We propose a calibration approach for correcting the selection bias in a non- 
representative internet population. This approach relies on a regression framework 
for calibrating the non-representative sample of Twitter users with the auxiliary 
marginal information from the ‘ground truth’ data source, using log-linear models 
with offsets. We extend a calibration methodology developed by Yildiz and Smith 
(2015) to the framework proposed by Zagheni and Weber (2015). 

If an auxiliary data source exists which can be assumed to measure the 
‘true’ population, it can be combined with the dataset containing the counts from the 
Twitter population. This approach proposes to compare the ‘true’ counts of specific 
population subgroups by age and sex in each geographical location obtained from 
the ‘ground truth’ data source, with those obtained from the non-representative 
sample. In this example, we compare the Twitter population to the usual resident 
population of South-East England using the 2014 mid-year population estimates. 
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This source is assumed to represent the ‘true’ population count in each of the 67 
local authorities in South East England, by gender and age group. 

We fit a sets of log-linear models with offsets which takes into account the 
fact that the Twitter sample differs from the ‘ground truth’ data in terms of 
association structures between age and/or gender and/or location. Such models are 
estimated by an iterative process are similar to multiplicative weighting, raking or 
raking ratio estimation. We employ the IPF algorithm to fit the log-linear models 
with offsets and produce maximum likelihood estimates. We evaluate the capability 
of each model of calibrating the Twitter users’ data. 

The best model, i.e. the model which reduces the bias between the Twitter 
sample and the total population the most, is the AS, AL model (the best model was 
chosen according to the mean percentage differences —see below-; results for other 
models are not shown). This model calibrates the Twitter population counts so that 
the marginal age-sex and sex-local authority marginal totals are equal to the ‘ground 
truth’ marginal totals. Instead, the three-way age-sex-local authority association 
structure is different from the ‘ground truth’ data source. The AS,SL Model can be 
written as follows: 


log(bast)= A+ Aa + AS +Ab+Aas*S+Ag$“+log(Tasi) (2) 


We denote the Census estimates and the MYE for age group a, sex s, and local 
authority 1 by Casi where a denotes age groups “0-19”, “20-29”, “30-39”, “40-49” 
and “50+”; and s =1, 2 for males and females respectively. We assume that Casi 
comes from a super population model and has Poisson distribution with mean pası. In 
this application we focus on the South-East region of England which consists of 67 
local authorities, i.e. 1=1, 2,..., 67. Tasi is the ‘offset’ term and denotes the count of 
Twitter users in local authority 1 who are estimated to be in age group a and sex s. 
The factor calibrates the Twitter sample to match the South-East total population 
count; A,“ calibrates its age distribution, irrespectively of sex and location; Aas"S 
calibrates its age-sex distribution, irrespectively of location; etc. 

In order to ease the interpretation of results, the models are evaluated using 
percentage differences between the Twitter population and the population estimates 
in the ‘ground truth’ data source, defined as follows: 


Dası= 100 x (Pas È cx!) / Cas! © 


where C**! denotes the population counts estimated by the MYE for age group a, sex 
s and local authority 1 and P°" the corresponding population counts. Figure 2 plots 
the mean percentage differences by age group and sex for the AS,AL model, using 
the CrowdFower estimates of age and sex. This figure shows that combining the 
Twitter sample with auxiliary age-sex and age-location association structures indeed 
decreases the bias in the Twitter sample substantially: the mean percentage 
differences decrease to reach the 0-5% range. We conclude that adjusting the Twitter 
sample by both the age group-gender association and the age group-region 
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association is needed in order to minimize the mean percentage differences with the 
“ground truth’ data source. 

Figure 2 also shows that overall our model slightly underestimates the 
populations of both sexes. The age category which is underestimated the most is the 
50+ for both sexes. 


Figure 2: Mean percentage difference between the MYE and calibrated models 
based on the Twitter users’ population according to age groups, 2014 CrowdFlower 
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5 Conclusions 


This paper proposed a modelling approach based on log-linear models with offsets 
for reducing the selection bias in the Twitter population. The population estimates 
derived from the model allows a considerable improvement towards the correction of 
the bias between the Twitter population and the real population, allowing researchers 
to make inference from the non-representative Twitter sample to the population of 
interest. 

Moreover, this contribution has compared the accuracy of the age and 
gender estimates produced by the crowd-sourcing and image-recognition 
approaches. One of the major drawbacks of the Face++ approach is that it takes only 
an image as its input variable. If there is no image available for a user, or if the 
image does not clearly display a human user, the Face++ algorithm fails. In contrast, 
CrowdFlower users are able to utilise the username, tweet content and description as 
well as the image to guess the demographics of the user. Whilst the CrowdFlower 
results are clearly the most accurate, Crowd-sourcing assignment is not free and can 
be time consuming. Face++ is free and comparatively quick and could thus be 
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considered the best approach for gender matching where there is an identifiable user 
in the profile image, whereas Face++ is not an effective tool for the measurement of 
age. 
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Structured Approaches for High-Dimensional 
Predictive Modeling 


Marco Seabra dos Reis! 


Abstract Current predictive analytics approaches are strongly focused on optimizing 
accuracy metrics, leaving little room to incorporate a priori knowledge about the 
processes under analysis and relegating to a secondary concern the interpretation of 
results (Hastie, Tibshirani, & Friedman, 2001; Reis & Saraiva, 2005; Rendall, 
Pereira, & Reis, 2017). However, in the analysis of complex systems, one of the 
main interests is precisely the induction of relevant associations, in order to 
understand or clarify the way systems operate. On the other hand, there is often 
information available regarding the structure of the processes, which could be used 
in benefit of the analysis and to enhance the interpretation of results. The importance 
of this issue is not new and has motivated the development of multiblock approaches 
that try to improve the interpretation of results, while maintaining the quality of 
predictions (Naes, Tomic, Afseth, Segtnan, & Mage, 2013; Tenenhaus & Tenenhaus, 
2014; Trygg & Wold, 1998; Westerhuis, Kourti, & MacGregor, 1998). 

In this paper, two classes of multiblock frameworks are addressed, that present 
interpretational-oriented features, while allowing some system structure to be 
incorporated. One class is based on the existence of a priori knowledge for building 
the blocks of variables, while the other is able to extract the system structure in a 
data-driven way (Reis, 2013a, 2013b). The introduction of such block structures in 
the predictive platforms constraint their predictive spaces, for the sake of enabling 
interpretable elements in the final model. These constraints do not usually 
compromise the methods’ performance when compared to their unconstrained 
counterparts, and sometimes even led to improvements in prediction ability, due to 
the use of more parsimonious and robust models. 


Key words: Multiblock methods; Network-Induced Supervised Learning; 
Concatenated PLS; Multiblock PLS; Hierarchical PLS; Sequential Orthogonalised 
PLS 
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1 Introduction 


Modeling is a fundamental piece of most workflows for process improvement, 
monitoring and control. With the increasing availability of data in the Big Data era 
and with the emergence of Industry 4.0, there is a strong emphasis in developing 
data-driven modeling approaches to address these tasks. In the domain of data-driven 
predictive frameworks, the mainstream methods tend to treat all variables, a priori in 
the same way, and the different methodologies available provide different solutions 
to the way variables are selected and/or combined in order to compound the final 
model. We call these methods “single-block”, as they treat all regressors equally in a 
first stage. 


Taking a closer look to the recurrent structures found in data and to the way 
systems generating them actually work, one can notice that, most often, not all 
variables are actively contributing to the phenomenon under study (as assumed in 
multivariate methods) nor are they bringing independent and isolated pieces of 
information to the model (for which variable selection schemes would be adequate). 
Rather, the prevailing structure seems to present the form of clusters of variables — 
modules — composed by sets of variables, where each cluster is relative to a given 
functional mechanism. Variables falling in the same cluster exhibit some degree of 
mutual correlation, and may be aggregated in a supper level in an hierarchical way at 
an higher level of abstraction (Clauset, Moore, & Newman, 2008; Guimerà & 
Amaral, 2005; Newman, 2006; Ravasz, Somera, Mongru, Oltvai, & Barabasi, 2002). 


In this context, instead of multivariate or variable selection schemes for handling 
high-dimensional systems, methods should enable the definition and selection of 
modules of variables that better reflect the system structure, which can then be 
selected for integrating the model according to their predictive power (Reis, 2013a, 
2013b). This setting calls for methods that are able to handle simultaneously several 
heterogeneous blocks of variables in their formulation, called hereafter as multiblock 
methods. 


Two classes of multiblock methods can be identified upon a close analysis of the 
available literature. On one hand, there are methods where the blocks of variables 
are defined based on a priori knowledge about the system’s structure, e.g., when 
variables regard different process units, arise from different analytical measurement 
sources or are related to distinct and identifiable functional modules of the system. 
On the other hand, one can find methods where such knowledge is not explicitly 
known, but a modular structure is believed to exist, that must be inferred and 
extracted from the available existing data. 


It is the purpose of this work to briefly provide an overview of both classes of 
multiblock methods and to illustrate their application with resort to a real world case 
study. This article is organized as follows. In the next section, the main 
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representatives of the two classes of methods, that will also integrated the present 
work, are briefly presented. Then, in Section 3, the results obtained are presented 
and discussed. Section 4 concludes this paper, with an overview of its contents and 
some concluding remarks about the relevance of considering multiblock methods in 
the analysis of high-dimensional systems. 


2 Multiblock methods for predictive data analysis 


In this section, a short presentation of the methods addressed in this work is 
provided. For more details on the implementation and use of these methods, we refer 
the interested readers to the references cited. 


2.1Class 1 — Composition of the blocks is known a priori 


Belong to this class all multiblock methods that assume the composition of the 
different blocks to be known a priori, i.e., the following mapping can be established 
using background knowledge: variable; + block; . This attribution is often possible 
to be made when there is sufficient knowledge about the system, and the way 
variables are naturally organized regarding how they contribute to the final outcome. 
Several multi- and megavariate methods fall in this category, and we will address the 
mainstream ones, namely, Concatenated PLS (CPLS), Hierarchical PLS (HPLS), 
Multiblock PLS (MPLS), as well as recent advances in this field, such as Sequential 
Orthogonalised PLS (SO-PLS). 


In brief terms, Concatenated PLS (CPLS) consists in concatenating all blocks of 
variables in a single augmented matrix and perform the classical PLS method over 
the entire data array. The different blocks should be weighted before being used in 
the model in order to give equal importance to all or to increase or decrease the 
importance of a given block in the model. Typically two block-scaling methods are 
described in the literature: soft block scaling and hard block scaling (Eriksson et al., 
2006). 


In Hierarchical PLS (HPLS), each data block is considered as a separate source 
of information and the multiblock model extracts the common structure for all the 
different blocks. This common structure forms the so-called super level of the model, 
combining information from all blocks of predictors at the lower levels. This means 
that block scores, loadings and weights for each separate block are available for 
interpretation in the lower level and super scores, loadings and weights are available 
in the super level for the interpretation of the global model. 
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Multiblock PLS (MBPLS) was proposed by Wold et al. (Wold, Martens, & 
Wold, 1983) and later by Wangen and Kowalski (Wangen & Kowalski, 1988). 
Similarly to the HPLS method, this method also presents two levels: the super level 
with global information and the lower level with information from each block. The 
main difference between this method and HPLS is that the Y block is regressed on 
all descriptor X blocks, whereas in HPLS the Y block is only regressed on the super 
block, which means that the block scores are calculated in an unsupervised way. This 
causes the block scores to be different in the two methods. 


Orthogonalized Partial Least Squares (SO-PLS) was proposed by Naes et al. 
(2013) and is a methodology that incorporates the several blocks of variables in the 
model, one at a time, while evaluating/interpreting the incremental or additional 
contribution of the different blocks for improving the model predictions. This 
capability is relevant when one wants to assess the gain of introducing an additional 
source of information. The sequential nature of SO-PLS implies that the order 
chosen for including the blocks in the model can influence the final result. 


2.2Class 2 — Composition of the blocks is unknown 


When the composition of the blocks is unknown, it must me induced from data. 
Network Induced Supervised Learning Regression method (NI-SL) is a method 
proposed by Reis (Reis, 2013a, 2013b), aiming at bringing interpretation features to 
the forefront of the analysis goals. The method was divided in two stages. Stage 1 
(Network-Induced Clustering) aims at finding functionally related groups of 
variables (clusters), which will form meaningful X blocks with predictive power for 
Y. The second stage consists in developing a predictive model, based on variates 
computed for the blocks induced in the first stage. For such, classical PLS models 
are developed separately between each X block and the Y response, and a predefined 
number of latent variables are retrieved from each block (in the present study five 
latent variables were retrieved from each block). These latent variables are gathered 
into a super block and a forward stepwise regression procedure is used to select the 
subgroup of latent variables that lead to the best fit. 


NI-SL can also belong to Class 1 if the cluster compositions are known 
beforehand, in which case the NI-clustering stage is bypassed and the method start 
immediately in the second stage of modelling. This is the case for the example 
addressed in this work, which will be described in the next section. 
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3 Results 


We illustrate the application of multi-block methodologies with resort to an example 
from the wine production industry. More specifically, this example is focused on the 
prediction of ageing time in Madeira wine, based on different analytical 
measurement sources: volatile profile (1% block), the polyphenols and two furanic 
compounds (2"° block), the organic acids quantification (3 block) and the 
ultraviolet-visible spectra (4° block). The volatile profile was analysed by gas 
chromatography coupled to mass spectrometry (GC-MS), preceded by solid phase 
extraction; the second block of data was obtained by High-Performance Liquid 
Chromatography combined with Photodiode Array Detection (HPLC-DAD; direct 
injection); organic acids (the 3" block of variables) were also quantified by Liquid 
Chromatography combined with Photodiode Array Detection; UV-Vis absorbance 
spectra (4° block of variables) was done in a Perkin-Elmer Lambda 2 
spectrophotometer (Waltham, MA, USA). More information about this case study 
and results is available elsewhere (Campos, Sousa, Pereira, & Reis, 2017). 


A total of 26 samples were analysed, covering a range of 20 years (2-3 wine 
samples were taken per ageing year, with 2 year intervals). All samples correspond 
to wines produced from the same grape variety (Malvasia) and were supplied from 
the same Madeira wine producer. 


In this paper, and due to space constraints, the analysis will be focused on the 
predictive capabilities of the methods, which was assessed according to the 
procedure described next. For each multiblock algorithm, 50 models were estimated 
using the datasets described above and Monte Carlo assignment of samples. In each 
Monte Carlo assignment, the dataset is randomly divided into a test set (20%) and a 
training set (80%). The training set is used to calibrate the model and to determine 
the respective hyper-parameters based on 10-fold cross validation method. This 
procedure is repeated 50 times originating 50 models for each multiblock method. 
The test sets are used for prediction based on which one computes the root mean 
square error of prediction (RMSEP), for each Monte Carlo run and each method, 
using equation 1. 


> ( Y pred i — Vobs,i y 


(1) 


Nest 


The distribution of the RMSEP over the 50 trials characterizes the method 
performance in terms of prediction accuracy and robustness. Moreover it can be used 
to compare different methods by evaluating the statistical difference in the prediction 
errors obtained by both, under similar testing conditions. 
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Table 1 presents the mean RMSEPs obtained for the several multiblock methods 
studies in this article. The methods leading to better performance are CPLS (with a 
new scaling methodology developed by the authors; see (Campos et al., 2017)) and 
SO-PLS, followed by NI-SL. The first two methods present superior predictive 
performances than the best linear multivariate methodologies applied to each block 
separately — see results for Principal Component Regression (PCR) and Partial Least 
Squares (PLS) in Table 2 (Rendall et al., 2017). These results indicate that it is 
possible to synergistically combine different sources of information for improving 
the predictive performance of the methods, even though the single-block methods 
based on the Polyphenol Content already lead to interesting predictive results. 
Moreover, the multiblock methodologies bring other interpretational dimensions to 
the analysis, namely regarding the importance of the different blocks for predicting 
the response and their redundancy or overlap, which are not addressed in this short 
article. 


Table 1. Average root mean square error of prediction and the respective interquartil range 
(IQR) obtained in the Monte Carlo Cross-Validation procedure for the different multiblock 
methods tested in this paper. 


Method IQR 
RMSEP  (15%-25%) 
Concatenated PLS (CPLS) 0.93 0.61 
Hierarchical PLS (HPLS) 1.48 0.44 
Multiblock PLS (MBPLS) - Block Scores 1.36 0.58 
deflation 
Multiblock PLS (MBPLS) - Super Scores 
1.34 0.54 
deflation 
Network Induced Supervised Learning 1.17 0.85 
(NI-SL) 
Sequential Orthogonal-Partial Least 0.97 0.49 


Squares (SO-PLS) 


Table 2. Average root mean square error of prediction and the respective interquartil range 
(IQR) obtained in the Monte Carlo Cross-Validation for single block approaches. 


Chemical Data Method RMSEP 
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PCR 1.18 
Polyphenol Content PLS 117 
Volatile Composition POR oo 

PLS 1.43 
UV-Vis PCR 2.23 

PLS 2.86 

PCR 2:93 
Organic Acids £ 

PLS 2.86 


4 Conclusions 


In this work, we illustrate the potential of using multiblock predictive methods in 
datasets composed by natural blocks of variables. The case study illustrates the 
advantage of using all blocks of variables simultaneously, in a structured way, rather 
than in an isolated fashion. 


Even though multiblock methods represent constraint versions of their single-block 
counterparts, the predictive ability found may not be inferior. On the contrary, it was 
often found to be superior, which is due to their more parsimonious nature that leads 
to a more stable parameter estimation and finally to more accurate predictions (Reis, 
2013a, 2013b). If, on top of this, one considers the expected higher interpretability 
of the multiblock methods, one can easily conclude about the increasing interest in 
adopting this modeling formalism to address the analysis of data collected from 
complex processes and phenomena. 
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